简单爬虫入门 - 训练爬虫？ - 《python全栈教学知识库》

训练爬虫？

其实就是训练你自己的爬虫技巧，从简单网页开始，掌握多种技能（语言，html，css，js，存储，并发，框架等），在操练中积累经验，逐渐可以爬取任意类型网站。

这里有一份《爬虫训练手册》送给你

**

¡ 从爬虫准备，爬行中，爬后分析的顺序分析爬虫相关概念与技术
¡ 版本：v0.1
¡ 作者：De8ug
¡ 制作：https://mubu.com/inv/457189

**

¡ 爬虫法制观念
n 刑法，网络安全法
n 相关案例
¡ 爬虫类型（★对应后文的技能要求）：
n 小爬爬-各种库★
n 中型爬虫-框架★★★
n 大爬爬-搜索引擎
¡ 目的
n 解决数据来源问题
n 做行业分析
n 完成自动化操作
n 做搜索引擎
¡ 目标类型
n 新闻/博客/微博
l 图片
l 新闻
l 评论
n 电影视频
l 视频
l 评论：播放器视频加载，评论动态加载，API的偏移取值
n 音乐
l 音频
l 评论
n 电商
l 商品短链接，js解析跳转
l 价格
l 评论
¡ 爬虫常见工具包
n 前端★
l http://www.w3school.com.cn/h.asp
l html:http://devdocs.io/html/
l dom: http://www.w3school.com.cn/htmldom/dom_nodes.asp
l css: http://devdocs.io/css/
l js: http://devdocs.io/javascript/
n 请求流程★
l 数据的传递：客户端 - 服务器
l request：
l 请求，将数据从浏览器发送到服务器response：
l 响应，浏览器接收到服务器返回的数据内容类型与状态码：http://tool.oschina.net/commons
n 语言★
l python
l go
l php
l nodejs
n xpath★:
l http://www.w3school.com.cn/xpath/xpath_syntax.asp
l //[@id=”maincontent”]/div[5]/p[2]
maincontent
n > div:nth-child(6) > p:nth-child(3)lxml:
l http://lxml.de/
l pip install lxml
n requests★:
l http://docs.python-requests.org/zh_CN/latest/
l pip install
l requestsget
l post
l response
¡ text
¡ json
n re★
l findall
l 符号说明
n bs4★
l https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
l pip install bs4
*l soup对象
l find_all
l get

**

¡ 静态网页爬虫流程★
n 分析网页结构
n 通过html的dom和css选择器提取数据
¡ 动态网页抓取
n 模拟点击，像真实用户一样浏览网页★
n 用浏览器driver根据页面下拉加载内容★
n 分析和使用API
n selenium★
l firefox
¡ https://github.com/mozilla/geckodriver/releases/tag/v0.20.1
l chrome
¡ https://chromedriver.storage.googleapis.com/index.html?path=2.38/
n splash+scrapy-splash
l http://scrapy-cookbook.readthedocs.io/zh_CN/latest/scrapy-01.html
n phantomjs
l http://phantomjs.org/download.html
¡ 模拟登录
n 验证码，尽量不让出现
n session &
¡ cookies分布式★★★
n 消息队列
n 多个worker
n celery
n 任务分配，搭配多少内存，进程
n rpc服务：爬取结果统一去放
n http服务：接收不同类型数据结果
n 批量插入数据库，减少压力
n scrapy
¡ +redis+mongodb进行分布式爬虫部署
n docker
n 版本控制：自动升级版本号
n 配置管理
¡ 常见问题
n 大量数据存储
n 移动端的抓取
n 下载限速
n 无限循环（爬虫陷阱）：控制深度
n 网络延时，断点重连：id，或时间判断
n 去除重复链接
n 反爬虫
n 脏数据，装饰器判断是爬虫给假数据返回
n 数据全面性：抓到可能的隐藏结果
¡ 反爬虫(爬取不要太频繁，攻守兼备)
n 服务器（守）
l 确认真实用户，多个参数，专门识别爬虫
l 使用隐藏的链接，吸引爬虫，并返回假数据

l 设置ip访问频率
¡ 免费/付费代理
¡ 端口转发
¡ ip代理
l UA，请求头太单一就是bug
l refer：修改一下url来源
l 变态验证码识别平台
l 多账号轮流发请求或者
l 库与框架
¡ scrapy★★★
¡ requests
¡ bs4
¡ lxml
¡ xpath
¡ urllib/urllib2（不用第三方依赖时的做法）

**

¡ 常见问题
n 大量数据存储
n 脏数据的分析
¡ 库与框架
n csv★
n mongodb★★★：不要求数据结构
n redis★★★
n MySQL★★★
¡ 优化
n 选择合适数据库，优化索引，mongodb
n or mysql请求次数，能批量就批量
n 并发根据需求调整
¡ 数据分析
n jieba
l https://github.com/fxsjy/jieba
n numpy
l https://docs.scipy.org/doc/numpy/user/quickstart.html
n pandas
l http://pandas.pydata.org/pandas-docs/stable/install.html
n matplotlib
l https://matplotlib.org/users/index.html
n plotnine
l http://plotnine.readthedocs.io/en/stable/tutorials.html
n seaborn
l https://seaborn.pydata.org/examples/index.html
n mizani
l http://mizani.readthedocs.io/en/stable/breaks.html

**