安装库

  1. pip install beautifulsoup4

测试

image.png

  1. >>> import requests
  2. >>> r = requests.get("http://python123.io/ws/demo.html")
  3. >>> r.text
  4. >>> demo = r.text
  5. >>> from bs4 import BeautifulSoup
  6. >>> soup = BeautifulSoup(demo, "html.parser")
  7. >>> print(soup.prettify())
  8. <html>
  9. <head>
  10. <title>
  11. This is a python demo page
  12. </title>
  13. </head>
  14. <body>
  15. <p class="title">
  16. <b>
  17. The demo python introduces several python courses.
  18. </b>
  19. </p>
  20. <p class="course">
  21. Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
  22. <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
  23. Basic Python
  24. </a>
  25. and
  26. <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
  27. Advanced Python
  28. </a>
  29. .
  30. </p>
  31. </body>
  32. </html>

一般使用格式

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(demo, 'html.parser')
  3. print(soup.prettify())

Beautiful Soup库的理解

image.png

  • Beautiful Soup库是解析、遍历、维护“标签树”的功能库

    库引用

    image.png

    Beautiful Soup类的理解

    image.png

    Beautiful Soup库的解析器

    image.png
    image.png

    Beautiful Soup基本元素

    image.png

    Tag标签

    1. >>> from bs4 import BeautifulSoup
    2. >>> soup = BeautifulSoup(demp, 'html.parser')
    3. >>> soup.title
    4. <title>This is a python demo page</title>
    5. >>> tag = soup.a
    6. >>> tag # 返回第一个a标签
    7. <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

    Tag的name(名字)

    1. >>> soup.a.name
    2. 'a'
    3. >>> soup.a.parent.name
    4. 'p'
    5. >>> soup.a.parent.parent.name
    6. 'body'

    Tag的attrs(属性)

  1. >>> tag = soup.a
  2. >>> tag.attrs
  3. {'href': 'http://www.icourse163.org/course/BIT-268001',
  4. 'class': ['py1'],
  5. 'id': 'link1'}
  6. >>> tag.attrs['class']
  7. ['py1']
  8. >>> tag.attrs['href']
  9. 'http://www.icourse163.org/course/BIT-268001'
  10. >>> type(tag)
  11. bs4.element.Tag
  12. >>> type(tag.attrs)
  13. dict

Tag的NavigableString

  1. >>> soup.p
  2. <p class="title"><b>The demo python introduces several python courses.</b></p>
  3. >>> soup.p.string
  4. 'The demo python introduces several python courses.'
  5. >>> type(soup.p.string)
  6. bs4.element.NavigableString
  • 标签中还包含标签,但string没有返回的内容,说明NavigableString可以跨越多个层次

    Tag的Comment

  1. >>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is a comment</p>", "html.parser")
  2. >>> newsoup.b.string
  3. 'This is a comment'
  4. >>> type(newsoup.b.string)
  5. bs4.element.Comment
  6. >>> newsoup.p.string
  7. 'This is a comment'
  8. >>> type(newsoup.p.string)
  9. bs4.element.NavigableString