layout: posttitle: python爬虫笔记
subtitle: 基础知识、requests、asyncio/aiohttp异步爬虫
date: 2019-08-13
author: NSX
header-img: img/post-bg-ios9-web.jpg
catalog: true
tags:
- Python
- 教程
- 爬虫
python爬虫笔记
- 01. 爬虫预备知识
02. 网络请求模块requests">02. 网络请求模块requests
03. JSON数据提取">03. JSON数据提取
04. re 正则表达式提取数据 - 正则表达式提取数据.md)">04. re 正则表达式提取数据 - 正则表达式提取数据.md)
- 正则表达式就是记录文本规则的代码
05. aiohttp异步爬虫
参考

layout: posttitle: python爬虫笔记
subtitle: 基础知识、requests、asyncio/aiohttp异步爬虫
date: 2019-08-13
author: NSX
header-img: img/post-bg-ios9-web.jpg
catalog: true
tags:
- Python
- 教程
- 爬虫

python爬虫笔记

爬虫预备知识
网络请求模块 requests
JSON数据提取
re正则取数据
aiohttp异步爬虫
01. 爬虫预备知识

爬虫定义

网络爬虫（又被称为网页蜘蛛，网络机器人）就是模拟浏览器发送网络请求，接收请求响应，一种按照一定的规则，自动地抓取互联网信息的程序。

爬虫流程

向起始url发送请求，并获取响应
对响应进行提取
如果提取url，则继续发送请求获取响应
如果提取数据，则将数据进行保存

网络模型对应关系

HTTP、RTSP、FTP ———-> 应用层
TCP、UDP ———-> 传输层
IP ———-> 网络层
数据链路 ———-> 数据链路层
物理介质 ———-> 物理层

url 地址格式

scheme：协议（例如：http, https, ftp）
host：服务器的 IP 地址或者域名
port：服务器的端口（如果是走协议默认端口，缺省端口80）
path：访问资源的路径
query-string：参数，发送给 http 服务器的数据
anchor：锚（跳转到网页的指定锚点位置）

HTTP 请求

请求方式 | 请求方式 | 描述 | | —- | —- | | GET | 请求指定的页面信息，并返回实体主体。 | | HEAD | 类似于 get 请求，只不过返回的响应中没有具体的内容，用于获取报头 | | POST | 向指定资源提交数据进行处理请求（例如提交表单或者上传文件）。数据被包含在请求体中。POST 请求可能会导致新的资源的建立和/或已有资源的修改。 | | PUT | 从客户端向服务器传送的数据取代指定的文档的内容 | | DELETE | 请求服务器删除指定的页面。 | | CONNECT | HTTP/1.1 协议中预留给能够将连接改为管道方式的代理服务器。 | | OPTIONS | 允许客户端查看服务器的性能。 | | TRACE | 回显服务器收到的请求，主要用于测试或诊断。 |

常见请求头 | 请求头 | 作用 | | —- | —- | | Cookie | Cookie | | User-Agent | 浏览器名称 | | Referer | 页面跳转处 | | Host | 主机和端口号 | | Connection | 链接类型 | | Upgrade-Insecure-Requests | 升级为 HTTPS 请求 | | Accept | 传输文件类型 | | Accept-Encoding | 文件编解码格式 | | x-requested-with : XMLHttpRequest | ajax 请求 |

模拟请求时，先在headers中加入User-Agent,如果还不可以请求再尝试加入Referer,还无法访问，在尝试加入Cookie,在最后可以尝试加入Host.

HTTP 响应

HTTP 状态码

当浏览者访问一个网页时，浏览者的浏览器会向网页所在服务器发出请求。当浏览器接收并显示网页前，此网页所在的服务器会返回一个包含 HTTP 状态码的信息头（server header）用以响应浏览器的请求。

HTTP 状态码的英文为 HTTP Status Code。HTTP 状态码共分为 5 种类型

分类	分类描述
1**	信息，服务器收到请求，需要请求者继续执行操作
2**	成功，操作被成功接收并处理
3**	重定向，需要进一步的操作以完成请求
4**	客户端错误，请求包含语法错误或无法完成请求
5**	服务器错误，服务器在处理请求的过程中发生了错误

常见的 HTTP 状态码：

200 - 请求成功
301 - 资源（网页等）被永久转移到其它 URL
404 - 请求的资源（网页等）不存在
500 - 内部服务器错误

02. 网络请求模块requests

requests 模块是可以模仿浏览器发送请求获取响应，能够自动帮助我们解压网页内容。

requests模块能够自动帮助我们解压网页内容

基本使用

import requests
def fetch(url):
    '''
    url: 要请求的网站url
    '''
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}  # 请求时携带的header
    resp = requests.get(url, headers=headers, verify=False)  # verify忽略https的ssl验证
    resp.encoding = resp.apparent_encoding  # 猜测编码，防乱码
    return resp.text
fetch('https://www.baidu.com')

response 常用属性：

response.text 返回响应内容，响应内容为 str 类型
respones.content 返回响应内容,响应内容为 bytes 类型
response.status_code 返回响应状态码
response.headers 返回响应头
response.cookies 返回响应的 RequestsCookieJar 对象

发送 GET 请求

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com/s'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义 GET 请求参数
params = {
  "kw":"hello"
}
# 使用 GET 请求参数发送请求
response = requests.get(url,headers=headers,params=params)
# 获取响应的 html 内容
html = response.text

发送请求时 params 参数作为 GET 请求参数

发送 POST 请求

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义post请求参数
data = {
  "kw":"hello"
}
# 使用 POST 请求参数发送请求
response = requests.post(url,headers=headers,data=data)
# 获取响应的 html 内容
html = response.text

发送请求时 data 参数作为 POST 请求参数

使用代理服务器

作用：

让服务器以为不是同一个客户端在请求
防止我们的真实地址被泄露，防止被追究

代理分类：

透明代理(Transparent Proxy)：透明代理虽然可以直接“隐藏”你的IP地址，但是还是可以查到你是谁。
匿名代理(Anonymous Proxy)：匿名代理比透明代理进步了一点：别人只能知道你用了代理，无法知道你是谁。
混淆代理(Distorting Proxies)：与匿名代理相同，如果使用了混淆代理，别人还是能知道你在用代理，但是会得到一个假的IP地址，伪装的更逼真
高匿代理(Elite proxy或High Anonymity Proxy)：可以看出来，高匿代理让别人根本无法发现你是在用代理，所以是最好的选择。

在使用的使用，毫无疑问使用高匿代理效果最好

使用方式

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义 代理服务器
proxies = {
  "http":"http://IP地址:端口号",
  "https":"https://IP地址:端口号"
}
# 使用 POST 请求参数发送请求
response = requests.get(url,headers=headers,proxies=proxies)
# 获取响应的 html 内容
html = response.text

发送请求携带 Cookies

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
  # 方式一：直接在请求头中携带Cookie内容
  "Cookie": "Cookie值"
}
# 方式二：定义 cookies 值
cookies = {
  "xx":"yy"
}
# 使用 POST 请求参数发送请求
response = requests.get(url,headers=headers,cookies=cookies)
# 获取响应的 html 内容
html = response.text

03. JSON数据提取

JSON (JavaScript Object Notation) 是一种轻量级的数据交换格式，它使得人们很容易的进行阅读和编写。同时也方便了机器进行解析和生成。适用于进行数据交互的场景，比如网站前台与后台之间的数据交互。

JSON相关的方法：

json.loads json字符串转 Python数据类型
json.dumps Python数据类型转 json字符串
json.load json文件转 Python数据类型
json.dump Python数据类型转 json文件

JSON数据提取代码示例：

#!/usr/bin/python3
# -*- coding: utf-8 -*-
# 导入模块
import json
# json.loads json字符串 转 Python数据类型
json_string = '''
{
    "name": "crise",
    "age": 18,
    "parents": {
        "monther": "妈妈",
        "father": "爸爸"
    }
}
'''
print("json_string数据类型：",type(json_string))
data = json.loads(json_string)
print("data数据类型：",type(data))
# json.dumps Python数据类型 转 json字符串
data = {
    "name": "crise",
    "age": 18,
    "parents": {
        "monther": "妈妈",
        "father": "爸爸"
    }
}
print("data数据类型：",type(data))
json_string = json.dumps(data)
print("json_string数据类型：",type(json_string))

04. re 正则表达式提取数据 - 正则表达式提取数据.md)

正则表达式就是记录文本规则的代码

re 模块常用方法：

match 开头匹配，只匹配一次
search 全局匹配，只匹配一次
findall 匹配所有符号条件的数据，返回是结果列表
优点：使用简单，缺点：必须把所有数据搜索返回再返回（如果数据量小）
finditer 迭代对象，迭代 Match 对象
优点：找到就返回，可以边找边返回（如果数据量大）

05. aiohttp异步爬虫

异步爬虫不同于多进程爬虫，它使用单线程(即仅创建一个事件循环，然后把所有任务添加到事件循环中)就能并发处理多任务。在轮询到某个任务后，当遇到耗时操作(如请求URL)时，挂起该任务并进行下一个任务，当之前被挂起的任务更新了状态(如获得了网页响应)，则被唤醒，程序继续从上次挂起的地方运行下去。极大的减少了中间不必要的等待时间。

参考1 参考2

aiohttp 基本介绍

aiohttp 可以用来实现异步网页请求以及作为异步服务器。aiohttp 是一个利用asyncio的库，性能会比同步的模块提高不少。

aiohttp改写 requests 爬虫代码：

import asyncio
import aiohttp
async def fetch(url):
    headers = {'content-type': 'application/json'}
    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=headers, verify_ssl=False) as resp:
            return await resp.text(errors='ignore')
if __name__ == "__main__":
    start_time = time.time()
    loop = asyncio.get_event_loop()
    get_future = asyncio.ensure_future(fetch('https://www.baidu.com')) # ?
    loop.run_until_complete(get_future)
    print(get_future.result())
    loop.close()

批量url爬取

有大量url时，每个都开一个会话显然是不好的，于是会写成这样：

import asyncio
import aiohttp
async def fetch(session, id_, url):
    headers = {'content-type': 'application/json'}
    data = json.dumps({"XX": "YY"})
    try:
        async with session.post(url, data=data, headers=headers, verify_ssl=False) as resp:
            assert resp.status == 200
            print(resp.charset)
            print(await resp.text())
            print(await resp.read())
            # print(await resp.json())    # aiohttp.client_exceptions.ContentTypeError: 0, message='Attempt to decode JSON with unexpected mimetype: text/plain;charset=utf-8'
            # The correct way of use is as follows:
            res = await resp.json(content_type=None)
            return res
    except Exception:
        print(f"{id_}, url: {url} error happened:")
        print(traceback.format_exc())
async def fetch_all(urls):
    '''
    urls: list[(id_, url)]
    '''
    async with aiohttp.ClientSession() as session:
        datas = await asyncio.gather(*[fetch(session, id_, url) for id_, url in urls], return_exceptions=True)
        for ind, data in enumerate(urls):
            id_, url = data
            if isinstance(datas[ind], Exception):
                print(f"{id_}, {url}: ERROR")
        return datas
urls = [(i, f'https://www.baidu.com/?tn={i}') for i in range(100)]
loop = asyncio.new_event_loop()  
# loop = asyncio.get_event_loop()
task = loop.create_task(fetch_all(urls))
loop.run_until_complete(task)
print(task.result())    # save data
loop.close()

错误处理

如果url太多，可能会报错ValueError: too many file descriptors in select()，根据stackoverflow所述，aiohttp默认设置中一次可以打开100个连接，而Windows一次最多只能打开64个socket，所以可以在fetch_all中添加一行：

connector = aiohttp.TCPConnector(limit=60)  # 60小于64。也可以改成其他数
async with aiohttp.ClientSession(connector=connector) as session:
    ...

此外，为防止爬的过客，可以使用信号量来控制：

import time
from lxml import etree
import aiohttp
import asyncio
sem = asyncio.Semaphore(10) # 信号量，控制协程数，防止爬的过快
async def get_title(url):
    with(await sem):
        async with aiohttp.ClientSession() as session:
            async with session.request('GET', url) as resp:  # 提出请求
                html = await resp.read()

参考

从零到一 Python 爬虫

Python实战异步爬虫(协程)+分布式爬虫(多进程)

4.2 加速爬虫: 异步加载 Asyncio

用aiohttp写爬虫

技术笔记

2019-08-13-python爬虫笔记

layout: posttitle: python爬虫笔记
subtitle: 基础知识、requests、asyncio/aiohttp异步爬虫
date: 2019-08-13
author: NSX
header-img: img/post-bg-ios9-web.jpg
catalog: true
tags:
- Python
- 教程
- 爬虫

python爬虫笔记

01. 爬虫预备知识

爬虫定义

爬虫流程

网络模型对应关系

url 地址格式

HTTP 请求

HTTP 响应

02. 网络请求模块requests

基本使用

发送 GET 请求

发送 POST 请求

使用代理服务器

发送请求携带 Cookies

03. JSON数据提取

04. re 正则表达式提取数据 - 正则表达式提取数据.md)

正则表达式就是记录文本规则的代码

05. aiohttp异步爬虫

aiohttp 基本介绍

批量url爬取

错误处理

参考

2019-08-13-python爬虫笔记

layout: posttitle: python爬虫笔记subtitle: 基础知识、requests、asyncio/aiohttp异步爬虫date: 2019-08-13author: NSXheader-img: img/post-bg-ios9-web.jpgcatalog: truetags:- Python- 教程- 爬虫

python爬虫笔记

01. 爬虫预备知识

爬虫定义

爬虫流程

网络模型对应关系

url 地址格式

HTTP 请求

HTTP 响应

02. 网络请求模块requests

基本使用

发送 GET 请求

发送 POST 请求

使用代理服务器

发送请求携带 Cookies

03. JSON数据提取

04. re 正则表达式提取数据 - 正则表达式 提取数据.md)

正则表达式就是记录文本规则的代码

05. aiohttp异步爬虫

aiohttp 基本介绍

批量url爬取

错误处理

参考

layout: posttitle: python爬虫笔记
subtitle: 基础知识、requests、asyncio/aiohttp异步爬虫
date: 2019-08-13
author: NSX
header-img: img/post-bg-ios9-web.jpg
catalog: true
tags:
- Python
- 教程
- 爬虫

04. re 正则表达式提取数据 - 正则表达式提取数据.md)