爬虫 - 《深度学习》

基础
多功能的 Requests
Cookie
Session
下载文件
- 使用 urlretrieve
- 使用 request
Asyncio 库
aiohttp
Selenium
- Python 控制浏览器
Scrapy

基础

from bs4 import BeautifulSoup
from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)
soup = BeautifulSoup(html, features='lxml')
print(soup.h1)
"""
<h1>爬虫测试1</h1>
"""
all_href = soup.find_all('a')
all_href = [l['href'] for l in all_href]
print('\n', all_href)
# ['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']
month = soup.find_all('li', {"class": "month"})
for m in month:
    print(m.get_text())
"""
一月
二月
三月
四月
五月
"""
jan = soup.find('ul', {"class": 'jan'})
d_jan = jan.find_all('li')              # use jan as a parent
for d in d_jan:
    print(d.get_text())
"""
一月一号
一月二号
一月三号
"""

多功能的 Requests

post
- 账号登录
- 搜索内容
- 上传图片
- 上传文件
- 往服务器传数据等
get
- 正常打开网页
- 不往服务器传数据

import requests
import webbrowser
param = {"wd": "莫烦Python"}  # 搜索的信息
r = requests.get('http://www.baidu.com/s', params=param)
print(r.url)
webbrowser.open(r.url)
# http://www.baidu.com/s?wd=%E8%8E%AB%E7%83%A6Python

Cookie

payload = {'username': 'Morvan', 'password': 'password'}
r = requests.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload)
# post登陆之后跳转的页面
print(r.cookies.get_dict())
# {'username': 'Morvan', 'loggedin': '1'}
r = requests.get('http://pythonscraping.com/pages/cookies/profile.php', cookies=r.cookies)
# get登陆之后使用到的界面
print(r.text)
# Hey Morvan! Looks like you're still logged into the site!

Session

session = requests.Session()
payload = {'username': 'Morvan', 'password': 'password'}
r = session.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload)
print(r.cookies.get_dict())
# {'username': 'Morvan', 'loggedin': '1'}
r = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(r.text)
# Hey Morvan! Looks like you're still logged into the site!

下载文件

import os
os.makedirs('./img/', exist_ok=True)
IMAGE_URL = "https://morvanzhou.github.io/static/img/description/learning_step_flowchart.png"

使用 urlretrieve

在 urllib 模块中, 提供了我们一个下载功能 urlretrieve. 使用起来很简单. 输入下载地址 IMAGE_URL 和要存放的位置. 图片就会被自动下载过去了.

from urllib.request import urlretrieve
urlretrieve(IMAGE_URL, './img/image1.png')

使用 request

而在 requests 模块, 也能拿来下东西. 下面的代码实现了和上面一样的功能, 但是稍微长了点. 但我们为什么要提到 requests 的下载呢? 因为使用它的另一种方法, 我们可以更加有效率的下载大文件.

import requests
r = requests.get(IMAGE_URL)
with open('./img/image2.png', 'wb') as f:
    f.write(r.content)

所以说, 如果你要下载的是大文件, 比如视频等. requests 能让你下一点, 保存一点, 而不是要全部下载完才能保存去另外的地方. 这就是一个 chunk 一个 chunk 的下载. 使用 r.iter_content(chunk_size) 来控制每个 chunk 的大小, 然后在文件中写入这个 chunk 大小的数据.

r = requests.get(IMAGE_URL, stream=True)    # stream loading
with open('./img/image3.png', 'wb') as f:
    for chunk in r.iter_content(chunk_size=32):
        f.write(chunk)

Asyncio 库

import asyncio
async def job(t):                   # async 形式的功能
    print('Start job ', t)
    await asyncio.sleep(t)          # 等待 "t" 秒, 期间切换其他任务
    print('Job ', t, ' takes ', t, ' s')
async def main(loop):                       # async 形式的功能
    tasks = [
    loop.create_task(job(t)) for t in range(1, 3)
    ]                                       # 创建任务, 但是不执行
    await asyncio.wait(tasks)               # 执行并等待所有任务完成
t1 = time.time()
loop = asyncio.get_event_loop()             # 建立 loop
loop.run_until_complete(main(loop))         # 执行 loop
loop.close()                                # 关闭 loop
print("Async total time : ", time.time() - t1)
"""
Start job  1
Start job  2
Job  1  takes  1  s
Job  2  takes  2  s
Async total time :  2.001495838165283
"""

aiohttp

import aiohttp


async def job(session):
    response = await session.get(URL)       # 等待并切换
    return str(response.url)


async def main(loop):
    async with aiohttp.ClientSession() as session:      # 官网推荐建立 Session 的形式
        tasks = [loop.create_task(job(session)) for _ in range(2)]
        finished, unfinished = await asyncio.wait(tasks)
        all_results = [r.result() for r in finished]    # 获取所有结果
        print(all_results)

t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print("Async total time:", time.time() - t1)

"""
['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
Async total time: 0.11447715759277344
"""

我们刚刚创建了一个 Session, 这是官网推荐的方式, 但是我觉得也可以直接用 request 形式, 细节请参考官方说明. 如果要获取网页返回的结果, 我们可以在 job() 中 return 个结果出来, 然后再在 finished, unfinished = await asyncio.wait(tasks) 收集完成的结果, 这里它会返回完成的和没完成的, 我们关心的都是完成的, 而且 await 也确实是等待都完成了才返回. 真正的结果被存放在了 result() 里面.

Selenium

如果你安装有任何的问题, 请在它们的官网上查询解决方案.

在这教你用火狐浏览器偷懒的一招, 因为暂时只有火狐上有这个插件. 插件 Katalon Recorder 下载的网址在这

Python 控制浏览器

好了, 有了这些代码, 我们就能回到 Python. 开始写 Python 的代码了. 这里十分简单! 我将 selenium 绑定到 Chrome 上 webdriver.Chrome(). 你可以绑其它的浏览器.

from selenium import webdriver

driver = webdriver.Chrome()     # 打开 Chrome 浏览器

# 将刚刚复制的帖在这
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页 html, 还能截图
html = driver.page_source       # get html
driver.get_screenshot_as_file("./img/sreenshot1.png")
driver.close()

不过每次都要看着浏览器执行这些操作, 有时候有点不方便. 我们可以让 selenium 不弹出浏览器窗口, 让它”安静”地执行操作. 在创建 driver 之前定义几个参数就能摆脱浏览器的身体了.

from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")       # define headless

driver = webdriver.Chrome(chrome_options=chrome_options)
...

Scrapy

Scrapy 官网
自家网站
官网教程英文, 中文
JasonDing 的学习Scrapy入门
young-hz 的Scrapy研究探索系列

一个正常的 scrapy 项目还包括有很多其他的内容(见下面)

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py