HTTP：解析 HTML 和 XHTML

Beautiful Soup：用于解析 HTML 和 XML 的 python 包
PyQuery：一个类似 jquery 的 Python 库
HTMLParser：简单的 HTML 和 XHTML 解析器

原文： https://pythonspot.com/http-parse-html-and-xhtml

在本文中，您将学习如何解析网站的 HTML（超文本标记语言）。有几个 Python 库可以实现这一目标。我们将演示一些流行的例子。

Beautiful Soup：用于解析 HTML 和 XML 的 python 包

该库非常流行，甚至可以处理格式错误的标记。要获取单个div的内容，可以使用以下代码：

from BeautifulSoup import BeautifulSoup
import urllib2
# get the contents
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()
parsed_html = BeautifulSoup(html)
print parsed_html.body.find('div', attrs={'class':'toc'})

这将输出 Wikipedia 文章中称为“toc”（目录）的div中的 HTML 代码。如果只希望使用原始文本，请使用：

print parsed_html.body.find('div', attrs={'class':'toc'}).text

如果要获取页面标题，则需要从标题部分获取：

print parsed_html.head.find('title').text

要从网站获取所有图像 URL，可以使用以下代码：

from BeautifulSoup import BeautifulSoup
import urllib2
url = 'https://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('img', src=True)
for link in links:
    print(link["src"])

要从网页中获取所有 URL，请使用以下命令：

from BeautifulSoup import BeautifulSoup
import urllib2
url = 'https://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('a')
for link in links:
    print(link["href"])

PyQuery：一个类似 jquery 的 Python 库

要从标记中提取数据，我们可以使用 PyQuery。它可以根据您的需要获取实际的文本内容和 html 内容。要获取标签，请使用调用pq('tag')。

from pyquery import PyQuery    
import urllib2
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()
pq = PyQuery(html)
tag = pq('div#toc')
# print the text of the div
print tag.text()
# print the html of the div
print tag.html()

要获得标题，只需使用：

tag = pq('title')

`HTMLParser`：简单的 HTML 和 XHTML 解析器

该库的用法非常不同。使用此库，您必须将所有逻辑都放在WebParser类中。用法的基本示例如下：

from HTMLParser import HTMLParser
import urllib2
# create parse
class WebParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Tag: " + tag
# get the contents
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()
# instantiate the parser and fed it some HTML
parser = WebParser()
parser.feed(html)

Beautiful Soup：用于解析 HTML 和 XML 的 python 包

PyQuery：一个类似 jquery 的 Python 库

HTMLParser：简单的 HTML 和 XHTML 解析器

`HTMLParser`：简单的 HTML 和 XHTML 解析器