一. 课程安排

课程内容
- bs4简介
- bs4使用
- 遍历树遍历子节点
- 遍历树遍历父节点
- 遍历树遍历兄弟结点
- 所搜树
- find_all()和find()
- 修改文档树

二. 课堂笔记

1. bs4简介

1.1 基本概念

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库

1.2 源码分析

github下载源码
安装
- pip install lxml
- pip install bs4

2. bs4的使用

2.1 快速开始

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# 获取bs对象
bs = BeautifulSoup(html_doc,'lxml')
# 打印文档内容(把我们的标签更加规范的打印)
print(bs.prettify())
print(bs.title) # 获取title标签内容 <title>The Dormouse's story</title>
print(bs.title.name) # 获取title标签名称 title
print(bs.title.string) # title标签里面的文本内容 The Dormouse's story
print(bs.p) # 获取p段落

2.2 bs4的对象种类

tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : bs对象
Comment : 注释

3. 遍历树遍历子节点

bs里面有三种情况，第一个是遍历，第二个是查找，第三个是修改

3.1 contents children descendants

contents 返回的是一个列表
children 返回的是一个迭代器通过这个迭代器可以进行迭代
descendants 返回的是一个生成器遍历子子孙孙

3.2 .string .strings .stripped strings

string获取标签里面的内容
strings 返回是一个生成器对象用过来获取多个标签内容
stripped strings 和strings基本一致但是它可以把多余的空格去掉

4. 遍历树遍历父节点

parent 和 parents

parent直接获得父节点
parents获取所有的父节点

5. 遍历树遍历兄弟结点

next_sibling 下一个兄弟结点
previous_sibling 上一个兄弟结点
next_siblings 下一个所有兄弟结点
previous_siblings上一个所有兄弟结点

6. 搜索树

字符串过滤器
正则表达式过滤器

我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法可以实现一个正则表达式的一个过滤器的搜索

列表过滤器
True过滤器
方法过滤器

7. find_all() 和 find()

7.1 find_all()

find_all()方法以列表形式返回所有的搜索到的标签数据
find()方法返回搜索到的第一条数据
find_all()方法参数

def find_all(self, name=None, attrs={}, recursive=True, text=None,
                 limit=None, **kwargs):

name : tag名称
attr : 标签的属性
recursive : 是否递归搜索
text : 文本内容
limli : 限制返回条数
kwargs : 关键字参数

7.2 find_parents() find_parent() find_next_siblings() find_next_sibling()

find_parents() 搜索所有父亲
find_parrent() 搜索单个父亲
find_next_siblings()搜索所有兄弟
find_next_sibling()搜索单个兄弟

7.3 find_previous_siblings() find_previous_sibling find_all_next() find_next()

find_previous_siblings() 往上搜索所有兄弟
find_previous_sibling() 往上搜索单个兄弟
find_all_next() 往下搜索所有元素
find_next()往下查找单个元素

8. 修改文档树

修改tag的名称和属性
修改string 属性赋值,就相当于用当前的内容替代了原来的内容
append() 像tag中添加内容,就好像Python的列表的 .append() 方法
decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉

Python爬虫

05 - BeautifulSoup4

一. 课程安排

二. 课堂笔记

1. bs4简介

1.1 基本概念

1.2 源码分析

2. bs4的使用

2.1 快速开始

2.2 bs4的对象种类

3. 遍历树遍历子节点

3.1 contents children descendants

3.2 .string .strings .stripped strings

4. 遍历树遍历父节点

5. 遍历树遍历兄弟结点

6. 搜索树

7. find_all() 和 find()

7.1 find_all()

7.2 find_parents() find_parent() find_next_siblings() find_next_sibling()

7.3 find_previous_siblings() find_previous_sibling find_all_next() find_next()

8. 修改文档树

05 - BeautifulSoup4

一. 课程安排

二. 课堂笔记

1. bs4简介

1.1 基本概念

1.2 源码分析

2. bs4的使用

2.1 快速开始

2.2 bs4的对象种类

3. 遍历树 遍历子节点

3.1 contents children descendants

3.2 .string .strings .stripped strings

4. 遍历树 遍历父节点

5. 遍历树 遍历兄弟结点

6. 搜索树

7. find_all() 和 find()

7.1 find_all()

7.2 find_parents() find_parent() find_next_siblings() find_next_sibling()

7.3 find_previous_siblings() find_previous_sibling find_all_next() find_next()

8. 修改文档树

3. 遍历树遍历子节点

4. 遍历树遍历父节点

5. 遍历树遍历兄弟结点