Beautiful Soup 简介
- 1）find_all( name , attrs , recursive , text , **kwargs )
- 2）find(name=None, attrs={}, recursive=True, text=None, **kwargs)
一.使用方法
- 1.创建beautifulsoup对象

Beautiful Soup 简介

简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：
Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

beautifulsoup的安装：pip install beautifulsoup4

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”]) BeautifulSoup(markup, “xml”)	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依

安装lxml: pip install lxml
安装HTML5lib: pip install html5lib

在beautifulsoup里由于在使用find等方法查找时会出现与Python自身类或方法形同的字母，所以在使用find，findall等方法时可以在values后面加上一个下划线来区别（）

1）find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

2）find(name=None, attrs={}, recursive=True, text=None, **kwargs)

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果。
.find(‘p’),.findAll(‘p’)：find返回的是字符串值，而且是返回从头查找到的第一个tag对。但是如果这第一个tag对包括大量的内容，父等级很高，则同时其内部所包含的，此级标签也全部都find。findAll返回值是个列表，如果发现了一个同名标签内含多个同名标签，则内部的标签一并归于该父标签显示，列表其他元素也不再体现那些内含的同名子标签。即findAll会返回所有符合要求的结果，并以list返回。

一.使用方法

1.创建beautifulsoup对象

本地文件情况：

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html)
soup = BeautifulSoup(open('index.html'))  #使用本地文件创建对象

链接获取文件情况：

import requests,json
from bs4 import BeautifulSoup
url = "https://www.dydytt.net/html/gndy/dyzz/index.html"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
rsp = requests.get(url,headers=headers)
rsp.encoding = "gb2312"                            #字符编码
file = rsp.text
html = BeautifulSoup(file, "html.parser")       #指定HTML解析器
#匹配class="co_content2"的div; class后面加上一个_ 是beautifulsoup内定义的专门与Python内类等情况相区别的
content = html.find("div",class_="co_content2" )    
print(content)