BeautifulSoup 教程 - 《ZetCode 中文系列教程》

BeautifulSoup
安装 BeautifulSoup
HTML 文件
BeautifulSoup 简单示例
BeautifulSoup 标签，名称，文本
BeautifulSoup 遍历标签
BeautifulSoup 子元素
BeautifulSoup 后继元素
BeautifulSoup 网页抓取
BeautifulSoup 美化代码
BeautifulSoup 通过 ID 查找元素
BeautifulSoup 查找所有标签
BeautifulSoup CSS 选择器
BeautifulSoup 附加元素
BeautifulSoup 插入元素
BeautifulSoup 替换文字
BeautifulSoup 删除元素

原文： http://zetcode.com/python/beautifulsoup/

BeautifulSoup 教程是 BeautifulSoup Python 库的入门教程。这些示例查找标签，遍历文档树，修改文档和刮取网页。

BeautifulSoup

BeautifulSoup 是用于解析 HTML 和 XML 文档的 Python 库。它通常用于网页抓取。 BeautifulSoup 将复杂的 HTML 文档转换为复杂的 Python 对象树，例如标记，可导航字符串或注释。

安装 BeautifulSoup

我们使用pip3命令安装必要的模块。

$ sudo pip3 install lxml

我们需要安装 BeautifulSoup 使用的lxml模块。

$ sudo pip3 install bs4

上面的命令将安装 BeautifulSoup。

HTML 文件

在示例中，我们将使用以下 HTML 文件：

index.html

<!DOCTYPE html>
<html>
    <head>
        <title>Header</title>
        <meta charset="utf-8">                   
    </head>
    <body>
        <h2>Operating systems</h2>
        <ul id="mylist" style="width:150px">
            <li>Solaris</li>
            <li>FreeBSD</li>
            <li>Debian</li>                      
            <li>NetBSD</li>           
            <li>Windows</li>         
        </ul>
        <p>
          FreeBSD is an advanced computer operating system used to 
          power modern servers, desktops, and embedded platforms.
        </p>
        <p>
          Debian is a Unix-like computer operating system that is 
          composed entirely of free software.
        </p>        
    </body>    
</html>

BeautifulSoup 简单示例

在第一个示例中，我们使用 BeautifulSoup 模块获取三个标签。

simple.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    print(soup.h2)
    print(soup.head)
    print(soup.li)

该代码示例将打印三个标签的 HTML 代码。

from bs4 import BeautifulSoup

我们从bs4模块导入BeautifulSoup类。 BeautifulSoup是从事工作的主要类。

with open("index.html", "r") as f:
    contents = f.read()

我们打开index.html文件并使用read()方法读取其内容。

soup = BeautifulSoup(contents, 'lxml')

创建了BeautifulSoup对象； HTML 数据将传递给构造器。第二个选项指定解析器。

print(soup.h2)
print(soup.head)

在这里，我们打印两个标签的 HTML 代码：h2和head。

print(soup.li)

有多个li元素；该行打印第一个。

$ ./simple.py 
<h2>Operating systems</h2>
<head>
<title>Header</title>
<meta charset="utf-8"/>
</head>
<li>Solaris</li>

这是输出。

BeautifulSoup 标签，名称，文本

标记的name属性给出其名称，text属性给出其文本内容。

tags_names.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    print("HTML: {0}, name: {1}, text: {2}".format(soup.h2, 
        soup.h2.name, soup.h2.text))

该代码示例打印h2标签的 HTML 代码，名称和文本。

$ ./tags_names.py 
HTML: <h2>Operating systems</h2>, name: h2, text: Operating systems

这是输出。

BeautifulSoup 遍历标签

使用recursiveChildGenerator()方法，我们遍历 HTML 文档。

traverse_tree.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    for child in soup.recursiveChildGenerator():
        if child.name:
            print(child.name)

该示例遍历文档树并打印所有 HTML 标记的名称。

$ ./traverse_tree.py 
html
head
title
meta
body
h2
ul
li
li
li
li
li
p
p

在 HTML 文档中，我们有这些标签。

BeautifulSoup 子元素

使用children属性，我们可以获取标签的子级。

get_children.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    root = soup.html
    root_childs = [e.name for e in root.children if e.name is not None]
    print(root_childs)

该示例检索html标记的子代，将它们放置在 Python 列表中，然后将其打印到控制台。由于children属性还返回标签之间的空格，因此我们添加了一个条件，使其仅包含标签名称。

$ ./get_children.py 
['head', 'body']

html标签有两个子元素：head和body。

BeautifulSoup 后继元素

使用descendants属性，我们可以获得标签的所有后代（所有级别的子级）。

get_descendants.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    root = soup.body
    root_childs = [e.name for e in root.descendants if e.name is not None]
    print(root_childs)

该示例检索body标记的所有后代。

$ ./get_descendants.py 
['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']

这些都是body标签的后代。

BeautifulSoup 网页抓取

请求是一个简单的 Python HTTP 库。它提供了通过 HTTP 访问 Web 资源的方法。

scraping.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
import requests as req
resp = req.get("http://www.something.com")
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title)
print(soup.title.text)
print(soup.title.parent)

该示例检索一个简单网页的标题。它还打印其父级。

resp = req.get("http://www.something.com")
soup = BeautifulSoup(resp.text, 'lxml')

我们获取页面的 HTML 数据。

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

我们检索标题的 HTML 代码，其文本以及其父级的 HTML 代码。

$ ./scraping.py 
<title>Something.</title>
Something.
<head><title>Something.</title></head>

这是输出。

BeautifulSoup 美化代码

使用prettify()方法，我们可以使 HTML 代码看起来更好。

prettify.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
import requests as req
resp = req.get("http://www.something.com")
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.prettify())

我们美化了一个简单网页的 HTML 代码。

$ ./prettify.py 
<html>
 <head>
  <title>
   Something.
  </title>
 </head>
 <body>
  Something.
 </body>
</html>

这是输出。

BeautifulSoup 通过 ID 查找元素

使用find()方法，我们可以通过各种方式（包括元素 ID）查找元素。

find_by_id.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    #print(soup.find("ul", attrs={ "id" : "mylist"}))
    print(soup.find("ul", id="mylist"))

该代码示例查找具有mylist ID 的ul标签。带注释的行是执行相同任务的另一种方法。

BeautifulSoup 查找所有标签

使用find_all()方法，我们可以找到满足某些条件的所有元素。

find_all.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    for tag in soup.find_all("li"):
        print("{0}: {1}".format(tag.name, tag.text))

该代码示例查找并打印所有li标签。

$ ./find_all.py 
li: Solaris
li: FreeBSD
li: Debian
li: NetBSD

这是输出。

find_all()方法可以获取要搜索的元素列表。

find_all2.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    tags = soup.find_all(['h2', 'p'])
    for tag in tags:
        print(" ".join(tag.text.split()))

该示例查找所有h2和p元素并打印其文本。

find_all()方法还可以使用一个函数，该函数确定应返回哪些元素。

find_by_fun.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
def myfun(tag):
    return tag.is_empty_element
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    tags = soup.find_all(myfun)
    print(tags)

该示例打印空元素。

$ ./find_by_fun.py 
[<meta charset="utf-8"/>]

文档中唯一的空元素是meta。

也可以使用正则表达式查找元素。

regex.py

#!/usr/bin/python3
import re
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    strings = soup.find_all(string=re.compile('BSD'))
    for txt in strings:
        print(" ".join(txt.split()))

该示例打印包含"BSD"字符串的元素的内容。

$ ./regex.py 
FreeBSD
NetBSD
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

这是输出。

BeautifulSoup CSS 选择器

通过select()和select_one()方法，我们可以使用一些 CSS 选择器来查找元素。

select_nth_tag.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    print(soup.select("li:nth-of-type(3)"))

本示例使用 CSS 选择器来打印第三个li元素的 HTML 代码。

$ ./select_nth_tag.py 
<li>Debian</li>

这是第三个li元素。

CSS 中使用#字符通过 ID 属性选择标签。

select_by_id.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    print(soup.select_one("#mylist"))

该示例打印具有mylist ID 的元素。

BeautifulSoup 附加元素

append()方法将新标签附加到 HTML 文档。

append_tag.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    newtag = soup.new_tag('li')
    newtag.string='OpenBSD'
    ultag = soup.ul
    ultag.append(newtag)
    print(ultag.prettify())

该示例附加了一个新的li标签。

newtag = soup.new_tag('li')
newtag.string='OpenBSD'

首先，我们使用new_tag()方法创建一个新标签。

ultag = soup.ul

我们获得对ul标签的引用。

ultag.append(newtag)

我们将新创建的标签附加到ul标签。

print(ultag.prettify())

我们以整齐的格式打印ul标签。

BeautifulSoup 插入元素

insert()方法在指定位置插入标签。

insert_tag.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    newtag = soup.new_tag('li')
    newtag.string='OpenBSD'
    ultag = soup.ul
    ultag.insert(2, newtag)
    print(ultag.prettify())

该示例将第三个位置的li标签插入ul标签。

BeautifulSoup 替换文字

replace_with()替换元素的文本。

replace_text.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    tag = soup.find(text="Windows")
    tag.replace_with("OpenBSD")
    print(soup.ul.prettify())

该示例使用find()方法查找特定元素，并使用replace_with()方法替换其内容。

BeautifulSoup 删除元素

decompose()方法从树中删除标签并销毁它。

decompose_tag.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')
    ptag2 = soup.select_one("p:nth-of-type(2)")
    ptag2.decompose()
    print(soup.body.prettify())

该示例删除了第二个p元素。

在本教程中，我们使用了 Python BeautifulSoup 库。

您可能也会对以下相关教程感兴趣： Pyquery 教程， Python 教程， Python 列表推导， OpenPyXL 教程，Python Requests 教程和 Python CSV 教程。