一、教程指南
二、项目实践
（四）爬虫逻辑
- 文件结构
- scrapy.cfg
Automatically created by: scrapy startproject
For more information about the [deploy] section see:
https://scrapyd.readthedocs.io/en/latest/deploy.html">https://scrapyd.readthedocs.io/en/latest/deploy.html
http://localhost:6800/">url = http://localhost:6800/
（五）运行项目
（六）具体发生了什么？
（七）递归爬取
（八）调试抓取代码
三、Scrapy Shell教程
四、Shell实践

一、教程指南

注意这不一定是最新版本。
中文版：版本比较滞后。
https://www.osgeo.cn/scrapy/index.html#
英文版：
https://docs.scrapy.org/en/latest/

二、项目实践

（一）安装Scrapy

1、安装python，要3.4以上，3.7比较好。
2、设置python的Path
python\
python\Scripts
3、安装Scrapy：

pip install Scrapy

（二）项目架构

Scrapy学习 - 图1
上面是Scrapy的架构图，下面简单介绍一下各个组件

Scrapy Engine：引擎用来处理整个系统的数据流，触发各个事件，是整个系统的核心部分。
Scheduler：调度器用来接受引擎发过来的Request请求, 压入队列中, 并在引擎再次请求的时候返回。
Downloader：下载器用于引擎发过来的Request请求对应的网页内容, 并将获取到的Responses返回给Spider。
Spiders：爬虫对Responses进行处理，从中获取所需的字段（即Item）,也可以从Responses获取所需的链接,让Scrapy继续爬取。
Item Pipeline:管道负责处理Spider中获取的实体，对数据进行清洗，保存所需的数据。
Downloader Middlewares:下载器中间件主要用于处理Scrapy引擎与下载器之间的请求及响应。
Spider Middlewares：爬虫中间件主要用于处理Spider的Responses和Requests。

（三）创建项目
Scrapy.exe就在python\scripts下，所以可以全局执行scrapy命令。
```
Scrapy startproject tutorial
```
将在当前目录下创建名为tutorial的爬虫项目

（四）爬虫逻辑

文件结构
项目名/
- scrapy.cfg — 部署文件
- 项目名/
  - init.py
  - settings.py — 配置文件
  - items.py — 爬取数据模型
  - piplines.py — 管道
  - middlewares.py — 中间件
  - spiders/
    - init.py
    - 项目名.py — 爬虫逻辑
      scrapy.cfg
      部署文件，所有的配置都要在这里部署，不然不会生效，这个文件强调的是部署（deploy） ```python
      Automatically created by: scrapy startproject
      #
      For more information about the [deploy] section see:
      https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings] default = projectX.settings

[deploy]

url = http://localhost:6800/

project = projectX

<a name="WJzrQ"></a>
### settings.py
配置文件。配置项有很多，在文件头部有链接查看配置的完整说明。
```python
# -*- coding: utf-8 -*-
# Scrapy settings for projectX project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import random
BOT_NAME = 'projectX'
SPIDER_MODULES = ['projectX.spiders']
NEWSPIDER_MODULE = 'projectX.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'projectX (+http://www.yourdomain.com)'
user_agent_list = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
]
USER_AGENT = random.choice(user_agent_list)
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'projectX.middlewares.ProjectxSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'projectX.middlewares.ProjectxDownloaderMiddleware': 543,
# }
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'projectX.pipelines.ProjectxPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py

提取非爬虫直接爬取数据时用到，就是复杂数据，比如获得下载链接还需要下载。
记得在settings中注册。

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class ProjectxPipeline:
    def process_item(self, item, spider):
        return item

middlewares.py

中间件，看上面架构图，理解其作用。

# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class ProjectxSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
        # Should return None or raise an exception.
        return None
    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.
        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i
    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.
        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass
    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.
        # Must return only requests (not items).
        for r in start_requests:
            yield r
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
class ProjectxDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

items.py

定义你爬取的数据的结构。

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ProjectxItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    # # 电影标题
    # title = scrapy.Field()
    # # 豆瓣评分
    # star = scrapy.Field()
    # # 主演信息
    # Staring = scrapy.Field()
    # # 豆瓣排名
    # rank = scrapy.Field()
    # # 描述
    # quote = scrapy.Field()
    # # 豆瓣详情页
    # url = scrapy.Field()

spiders/项目名.py

这个文件名是你创建时命名的。不一定是这个名字。
这是爬虫的爬取逻辑，包括如何起始url，提取页面逻辑等。

# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):  # Scrapy.Spider类型爬虫
    # 标识蜘蛛。它在一个项目中必须是唯一的，也就是说，不能为不同的蜘蛛设置相同的名称。
    name = 'example'
    allowed_domains = ['example.com']
    # 启动请求方法的快捷方式
    # 数组形式
    # start_urls = ['http://example.com/']
    # 函數形式
    # 必须返回一个ITable of requests
    # （您可以返回一个请求列表或编写一个生成器函数），
    # 蜘蛛将从中开始爬行。随后的请求将从这些初始请求中依次生成。
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
        # 获取运行时命令行参数
        # url = 'http://quotes.toscrape.com/'
        # tag = getattr(self, 'tag', None)
        # if tag is not None:
        #     url = url + 'tag/' +tag
        # yield scrapy.Request(url, self.parse)
    # 被调用时，每个初始URL完成下载后生成的 Response对象将会作为唯一的参数传递给该函数。
    # 该方法负责解析返回的数据(response data)，
    # 提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。
    def parse(self, response):
        for quote in response.css('div.quote'):
            # 日志中输出
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }
        # 提取url方式一：发送Request
        # 最麻烦
        for next_page in response.css('li.next a::attr(href)'):
            # 可能是相对路径，所以要加这步
            next_page = response.urljoin(next_page)
            # 回调parse，递归爬取
            yield scrapy.Request(next_page, callback=self.parse)
        # 提取url方式二：response.follow(strHref)
        # for href in response.css('li.next a::attr(href)'):
        #     yield response.follow(href, callback=self.parse)
        # 提取url方式三：response.follow(元素)
        # 最简洁
        # for a in response.css('li.next a'):
        #     yield response.follow(a, callback=self.parse)

（五）运行项目

#运行quotes爬虫
scrapy crawl quotes
#存储爬取数据之json
#当前目录下生成一个quotes.json文件，出于历史原因，scrapy会附加到给定的文件(append)，而不是覆盖其内容。如果在第二次运行此命令两次之前不删除该文件，则最终会得到一个损坏的JSON文件。
scrapy crawl quotes -o quotes.json
#命令行传参：-a tag=fuck
#它将传到爬虫逻辑的__init__方法中，变成爬虫的默认参数。
scrapy crawl quotes -o quotes-humor.json -a tag=humor
#存储爬取数据之JSON Lines
#JSON Lines非常有用，类似于流，因为输出文件可以直接append而不会解析失败，另外每一条记录记录都是独立行方便处理大量数据。
scrapy crawl quotes -o quotes.jl
#上面可以应付直接存储直接爬取到的数据。
#存储更复杂的数据比如图片视频就要用到Item Pipeline，
#在pipelines.py中

（六）具体发生了什么？

Scrapy安排了 scrapy.Request 返回的对象 start_requests 蜘蛛的方法。在接收到每个响应时，它实例化 Response 对象并调用与请求关联的回调方法（在本例中，为 parse 方法）将响应作为参数传递。

（七）递归爬取

不能就爬取start_urls或者start_requests里面的几条链接，很可能是要爬取网站下所有。那就需要在爬取的网页中提取链接。

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
                #提取页面中的链接
        for a in response.css('li.next a'):
        yield response.follow(a, callback=self.parse)
        #response.follow，既可以接收string参数也可以接收元素类型。
        #注意，下面这样是错误的，因为css返回的是列表
        response.follow(response.css('li.next a'), callback=self.parse) 
        #但是下面这样，也是可以的。
        response.follow(response.css('li.next a')[0], callback=self.parse)

（八）调试抓取代码

爬虫一运行就停不下来，肯定不能正常调试，Scrapy Shell就可以通过命令行方式执行一次抓取，这样让调试就变得非常简单了。

三、Scrapy Shell教程

可交互的shell，可以输入代码调试抓取代码，无需运行爬虫，即可调试抓取代码，非常有用。

（一）配置Shell

1、方法一

配置环境变量：SCRAPY_PYTHON_SHELL，路径

Python\Lib\site-packages\scrapy
Python\Lib\site-packages\scrapy\commands
#两个下面都有shell.py不知道是哪个

2、方法二

在scrapy.cfg中定义

[settings]
shell = bpython

（二）抓取一次数据

#抓取网页链接
#注意要带上引号，windows上要用双引号，unix系统要用单引号。
scrapy shell 'http://quotes.toscrape.com/page/1/'
#抓取本地文件，如果是本地文件时，注意shell先识别URL再识别路径。
# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html
#File URI
scrapy shell file:///absolute/path/to/file.html
scrapy shell index.html
#不能正常运行，因为index.html被认为是url，html是相当于com

结果如下：

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

（三）使用css抽取数据

#返回的是[]形式，是列表，可能有多个
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
#提取title的文本，注意返回的是列表
>>> response.css('title::text').getall()
['Quotes to Scrape']
#提取的是title整个元素，注意是列表
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
#提取列表第一个title的文本
>>> response.css('title::text').get()
'Quotes to Scrape'
#同上，但是如果一个没有[0]就会报错，不建议
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
>>> response.css('title::text').get()
'Quotes to Scrape'
#正则表达是提取
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
#正则表达是提取
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
#正则表达是提取
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

（四）Selector Gadget 获取CSS

很好的工具，可以快速找到视觉上选中的元素的CSS选择器，它可以在许多浏览器中使用。

（五）除了CSS，也可以XPath

css底层其实就是转换成xpath，但是css明显要更流行。

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

四、Shell实践

Http://quotes.toscrape.com的HTML结构如下：

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

scrapy shell 'http://quotes.toscrape.com'
>>> response.css("div.quote")
>>> quote = response.css("div.quote")[0]
>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>

Scrapy学习