网络爬虫 - 02 Beautiful Soup 简单示例 - 《Python程序设计数字教程》

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过转换器实现惯用的文档导航、查找和修改文档的方式，可以极大的提高工作效率。
Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。用户不需要考虑编码方式，只有在文档没有指定一个编码方式时，用户才需说明一下原始编码方式，以保证Beautiful Soup能自动识别编码方式。
BeautifulSoup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。BeautifulSoup3目前已经停止开发，推荐在现在的项目中使用BeautifulSoup4，不过它已经被移植到bs4了，也就是说导入时需要导入bs4。
安装时使用：
pip install bs4
导入时使用：
import bs4
beautifulsoup支持标准库中包含的HTML解析器，也支持许多第三方的解析器，例如lxml和html5lib等，这两个解析器可以通过pip进行安装。

from bs4 import BeautifulSoup
# 下面代码示例都是用此文档测试
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(html, features='lxml')
print(soup)

print(soup)
<html>
    <head>
           <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>
        <p class="story">...</p>
    </body>
</html>

soup = BeautifulSoup(html, features='lxml')
tag = soup.a
navstr = tag.string
print(tag)
print(navstr)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie

print(soup.head)      # 获取head标签
<head><title>The Dormouse's story</title></head>

print(soup.a.string)  # 获取a标签下的文本，只获取第一个
Elsie

print(soup.p.string)  # 获取p节点下的内容
The Dormouse's story

print(soup.p.b)       # 获取p节点下的b节点
<b>The Dormouse's story</b>

print(soup.find('a'))                  # find返回匹配到的单个元素
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soup.find_all('a'))              # fina_all返回匹配到的所有元素
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.find_all('a', id='link1'))  # 返回匹配到的所有元素中id='link1'的元素
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]