2、requests介绍 - 2.1requests介绍 - 《Python爬虫》

1、requests库相比于urllib的优点
2、requests作用
3、res.encoding
4、常用属性
5、下载图片
6、发送带headers的请求
7、发送带参数的get请求；
- 7.1 url编码

1、requests库相比于urllib的优点

requets底层实现是urllib
requests在Python2和Python3中通用
requests简单易用
requests可以自动解压(gzip压缩的)网页内容

2、requests作用

模拟客户端发送请求，返回响应数据；

3、res.encoding

requests会从服务器返回的响应头中的Content-Type 去获取字符集编码，如果Content-Type中有charset字段，那么requests就可以正确使用编码来解码，否则就使用默认的encoding=”ISO-8859-1”来解码(因为requests库是欧洲人写的），由于百度首页是通过utf-8格式来编码的，而编码与解码不一致，导致出现乱码。
encoding的注释：Encoding to decode with when accessing r.text：当访问r.text时，使用encoding的值来解码；

import requests
res = requests.get("https://www.baidu.com")
# 获取字符集编码
print(res.encoding)
res.encoding = "utf-8"
print(res.headers["Content-Type"])
# 获取响应头
print(res.headers)
print(res.text)

4、常用属性

res.headers：获取响应头；
res.request.headers：获取请求头；
res.request.url：获取请求的url地址；
res.encoding：使用encoding来解码；
res.status_code：获取响应状态码；
res.text：获取响应文本内容，返回的是str；
res.content：获取响应字节内容(包括图片、音频、视频等等)，返回的是byte；
res.json()：获取相应的json内容，返回的是字典，其中的json.loads()自动将json转成了字典；

5、下载图片

import requests
res = requests.get("https://www.baidu.com/img/bd_logo1.png?where=super")
with open("img/baidulogo.png", "wb") as f:
    f.write(res.content)

6、发送带headers的请求

模拟浏览器，欺骗服务器，获取和浏览器一致的内容；
headers的形式是字典，将请求头中的那些参数写入字典中；

header的形式：字典
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
用法：requests.get(url, headers=headers)

7、发送带参数的get请求；

参数：get请求中的参数常常是以？的形式开始，每组参数以键值对的形式体现，不同组之间用&连接；
参数的形式：字典
params = {‘wd’:’居然’}
用法:requests.get(url, params=params)
注意：
（1）？会自动加上，原url中可以不用写？
（2）使用params参数的好处是，当其中含有中文等非ASCII集合中的字符时，会自动帮我们进行url编码；否则需要手动编码然后传入；

但也不一定要使用params，有时直接写在url中会更方便一些；

import requests
url = "https://www.cn.bing.com/search"
# url = "https://www.cn.bing.com/search?q=Python"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}
params = {
    "q": "Python"
}
res = requests.get(url, headers=headers, params=params)
print(res.request.url)

7.1 url编码

按照标准，URL只允许一部分ASCII字符，其他字符（如汉字）是不符合标准的，此时就要进行编码。
因为我在构造URL的过程中要使用到中文，所以需要对其进行编码；

使用requests中的utils.quote方法和utils.unquoter方法
使用urllib中的parse模块的quote方法和unquote方法 ```python from urllib import parse

name = “贴吧” print(parse.quote(name)) # %E8%B4%B4%E5%90%A7 print(parse.unquote(“%E8%B4%B4%E5%90%A7”)) # 贴吧

<br />
<a name="DPqBV"></a>
### 8、贴吧练习
注意：保存文件时：文件名中不可以含的特殊符号 '*', '|', ':', '?', '/', '<', '>', '"', '\\'<br />可以使用正则中的替换，re.sub(r"[], "", count=0)
```python
import requests
from urllib import parse
class TiebaSpider(object):
    def __init__(self, tieba_name):
        self.tieba_name = tieba_name
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
        }
        self.url = "https://tieba.baidu.com/f?kw={}&ie=utf-8&pn={}"
    def get_url_list(self):
        url_list = []
        for i in range(10):
            url_list.append(self.url.format(self.tieba_name, i*50))
        return url_list
    def tieba_parsed(self, tieba_url):
        res = requests.get(tieba_url, headers=self.headers)
        print(res.status_code)
        return res
    def save_html(self, page, res):
        # 文件名中不可以含的特殊符号 '*', '|', ':', '?', '/', '<', '>', '"', '\\'
        filename = "贴吧/贴吧 {}-第{}页.html".format(parse.unquote(self.tieba_name), page)
        with open(filename, "wb") as f:
            f.write(res.content)
    def run(self):
        """实现主要业务逻辑"""
        # 1、构造url列表
        url_list = self.get_url_list()
        # 2、遍历发送请求，获取响应
        for tieba_url in url_list:
            res = self.tieba_parsed(tieba_url)
            page = url_list.index(tieba_url) + 1
            # 3、保存HTML页面
            self.save_html(page, res)
if __name__ == '__main__':
    tieba_name = parse.quote("植物大战僵尸2")
    tieba_spider = TiebaSpider(tieba_name)
    tieba_spider.run()