一、教程指南

注意这不一定是最新版本。
中文版:版本比较滞后。
https://www.osgeo.cn/scrapy/index.html#
英文版:
https://docs.scrapy.org/en/latest/

二、项目实践

(一)安装Scrapy

1、安装python,要3.4以上,3.7比较好。
2、设置python的Path
python\
python\Scripts
3、安装Scrapy:

  1. pip install Scrapy

(二)项目架构

Scrapy学习 - 图1
上面是Scrapy的架构图,下面简单介绍一下各个组件

  • Scrapy Engine:引擎用来处理整个系统的数据流,触发各个事件,是整个系统的核心部分。
  • Scheduler:调度器用来接受引擎发过来的Request请求, 压入队列中, 并在引擎再次请求的时候返回。
  • Downloader:下载器用于引擎发过来的Request请求对应的网页内容, 并将获取到的Responses返回给Spider。
  • Spiders:爬虫对Responses进行处理,从中获取所需的字段(即Item),也可以从Responses获取所需的链接,让Scrapy继续爬取。
  • Item Pipeline:管道负责处理Spider中获取的实体,对数据进行清洗,保存所需的数据。
  • Downloader Middlewares:下载器中间件主要用于处理Scrapy引擎与下载器之间的请求及响应。
  • Spider Middlewares:爬虫中间件主要用于处理Spider的Responses和Requests。

    (三)创建项目

    Scrapy.exe就在python\scripts下,所以可以全局执行scrapy命令。

    1. Scrapy startproject tutorial

    将在当前目录下创建名为tutorial的爬虫项目

    (四)爬虫逻辑

    文件结构

  • 项目名/

[settings] default = projectX.settings

[deploy]

url = http://localhost:6800/

project = projectX

  1. <a name="WJzrQ"></a>
  2. ### settings.py
  3. 配置文件。配置项有很多,在文件头部有链接查看配置的完整说明。
  4. ```python
  5. # -*- coding: utf-8 -*-
  6. # Scrapy settings for projectX project
  7. #
  8. # For simplicity, this file contains only settings considered important or
  9. # commonly used. You can find more settings consulting the documentation:
  10. #
  11. # https://docs.scrapy.org/en/latest/topics/settings.html
  12. # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  13. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  14. import random
  15. BOT_NAME = 'projectX'
  16. SPIDER_MODULES = ['projectX.spiders']
  17. NEWSPIDER_MODULE = 'projectX.spiders'
  18. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  19. #USER_AGENT = 'projectX (+http://www.yourdomain.com)'
  20. user_agent_list = [
  21. "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  22. "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  23. "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
  24. "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
  25. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
  26. "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
  27. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
  28. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
  29. "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  30. "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  31. "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
  32. "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
  33. "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
  34. ]
  35. USER_AGENT = random.choice(user_agent_list)
  36. # Obey robots.txt rules
  37. ROBOTSTXT_OBEY = True
  38. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  39. #CONCURRENT_REQUESTS = 32
  40. # Configure a delay for requests for the same website (default: 0)
  41. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
  42. # See also autothrottle settings and docs
  43. #DOWNLOAD_DELAY = 3
  44. # The download delay setting will honor only one of:
  45. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  46. #CONCURRENT_REQUESTS_PER_IP = 16
  47. # Disable cookies (enabled by default)
  48. #COOKIES_ENABLED = False
  49. # Disable Telnet Console (enabled by default)
  50. #TELNETCONSOLE_ENABLED = False
  51. # Override the default request headers:
  52. # DEFAULT_REQUEST_HEADERS = {
  53. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  54. # 'Accept-Language': 'en',
  55. # }
  56. # Enable or disable spider middlewares
  57. # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  58. # SPIDER_MIDDLEWARES = {
  59. # 'projectX.middlewares.ProjectxSpiderMiddleware': 543,
  60. # }
  61. # Enable or disable downloader middlewares
  62. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  63. # DOWNLOADER_MIDDLEWARES = {
  64. # 'projectX.middlewares.ProjectxDownloaderMiddleware': 543,
  65. # }
  66. # Enable or disable extensions
  67. # See https://docs.scrapy.org/en/latest/topics/extensions.html
  68. # EXTENSIONS = {
  69. # 'scrapy.extensions.telnet.TelnetConsole': None,
  70. # }
  71. # Configure item pipelines
  72. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  73. ITEM_PIPELINES = {
  74. 'projectX.pipelines.ProjectxPipeline': 300,
  75. }
  76. # Enable and configure the AutoThrottle extension (disabled by default)
  77. # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
  78. #AUTOTHROTTLE_ENABLED = True
  79. # The initial download delay
  80. #AUTOTHROTTLE_START_DELAY = 5
  81. # The maximum download delay to be set in case of high latencies
  82. #AUTOTHROTTLE_MAX_DELAY = 60
  83. # The average number of requests Scrapy should be sending in parallel to
  84. # each remote server
  85. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  86. # Enable showing throttling stats for every response received:
  87. #AUTOTHROTTLE_DEBUG = False
  88. # Enable and configure HTTP caching (disabled by default)
  89. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  90. #HTTPCACHE_ENABLED = True
  91. #HTTPCACHE_EXPIRATION_SECS = 0
  92. #HTTPCACHE_DIR = 'httpcache'
  93. #HTTPCACHE_IGNORE_HTTP_CODES = []
  94. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py

提取非爬虫直接爬取数据时用到,就是复杂数据,比如获得下载链接还需要下载。
记得在settings中注册。

  1. # -*- coding: utf-8 -*-
  2. # Define your item pipelines here
  3. #
  4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  6. class ProjectxPipeline:
  7. def process_item(self, item, spider):
  8. return item

middlewares.py

中间件,看上面架构图,理解其作用。

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your spider middleware
  3. #
  4. # See documentation in:
  5. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  6. from scrapy import signals
  7. class ProjectxSpiderMiddleware:
  8. # Not all methods need to be defined. If a method is not defined,
  9. # scrapy acts as if the spider middleware does not modify the
  10. # passed objects.
  11. @classmethod
  12. def from_crawler(cls, crawler):
  13. # This method is used by Scrapy to create your spiders.
  14. s = cls()
  15. crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  16. return s
  17. def process_spider_input(self, response, spider):
  18. # Called for each response that goes through the spider
  19. # middleware and into the spider.
  20. # Should return None or raise an exception.
  21. return None
  22. def process_spider_output(self, response, result, spider):
  23. # Called with the results returned from the Spider, after
  24. # it has processed the response.
  25. # Must return an iterable of Request, dict or Item objects.
  26. for i in result:
  27. yield i
  28. def process_spider_exception(self, response, exception, spider):
  29. # Called when a spider or process_spider_input() method
  30. # (from other spider middleware) raises an exception.
  31. # Should return either None or an iterable of Request, dict
  32. # or Item objects.
  33. pass
  34. def process_start_requests(self, start_requests, spider):
  35. # Called with the start requests of the spider, and works
  36. # similarly to the process_spider_output() method, except
  37. # that it doesn’t have a response associated.
  38. # Must return only requests (not items).
  39. for r in start_requests:
  40. yield r
  41. def spider_opened(self, spider):
  42. spider.logger.info('Spider opened: %s' % spider.name)
  43. class ProjectxDownloaderMiddleware:
  44. # Not all methods need to be defined. If a method is not defined,
  45. # scrapy acts as if the downloader middleware does not modify the
  46. # passed objects.
  47. @classmethod
  48. def from_crawler(cls, crawler):
  49. # This method is used by Scrapy to create your spiders.
  50. s = cls()
  51. crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  52. return s
  53. def process_request(self, request, spider):
  54. # Called for each request that goes through the downloader
  55. # middleware.
  56. # Must either:
  57. # - return None: continue processing this request
  58. # - or return a Response object
  59. # - or return a Request object
  60. # - or raise IgnoreRequest: process_exception() methods of
  61. # installed downloader middleware will be called
  62. return None
  63. def process_response(self, request, response, spider):
  64. # Called with the response returned from the downloader.
  65. # Must either;
  66. # - return a Response object
  67. # - return a Request object
  68. # - or raise IgnoreRequest
  69. return response
  70. def process_exception(self, request, exception, spider):
  71. # Called when a download handler or a process_request()
  72. # (from other downloader middleware) raises an exception.
  73. # Must either:
  74. # - return None: continue processing this exception
  75. # - return a Response object: stops process_exception() chain
  76. # - return a Request object: stops process_exception() chain
  77. pass
  78. def spider_opened(self, spider):
  79. spider.logger.info('Spider opened: %s' % spider.name)

items.py

定义你爬取的数据的结构。

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your scraped items
  3. #
  4. # See documentation in:
  5. # https://docs.scrapy.org/en/latest/topics/items.html
  6. import scrapy
  7. class ProjectxItem(scrapy.Item):
  8. # define the fields for your item here like:
  9. name = scrapy.Field()
  10. # # 电影标题
  11. # title = scrapy.Field()
  12. # # 豆瓣评分
  13. # star = scrapy.Field()
  14. # # 主演信息
  15. # Staring = scrapy.Field()
  16. # # 豆瓣排名
  17. # rank = scrapy.Field()
  18. # # 描述
  19. # quote = scrapy.Field()
  20. # # 豆瓣详情页
  21. # url = scrapy.Field()

spiders/项目名.py

这个文件名是你创建时命名的。不一定是这个名字。
这是爬虫的爬取逻辑,包括如何起始url,提取页面逻辑等。

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class ExampleSpider(scrapy.Spider): # Scrapy.Spider类型爬虫
  4. # 标识蜘蛛。它在一个项目中必须是唯一的,也就是说,不能为不同的蜘蛛设置相同的名称。
  5. name = 'example'
  6. allowed_domains = ['example.com']
  7. # 启动请求方法的快捷方式
  8. # 数组形式
  9. # start_urls = ['http://example.com/']
  10. # 函數形式
  11. # 必须返回一个ITable of requests
  12. # (您可以返回一个请求列表或编写一个生成器函数),
  13. # 蜘蛛将从中开始爬行。随后的请求将从这些初始请求中依次生成。
  14. def start_requests(self):
  15. urls = [
  16. 'http://quotes.toscrape.com/page/1/',
  17. 'http://quotes.toscrape.com/page/2/',
  18. ]
  19. for url in urls:
  20. yield scrapy.Request(url=url, callback=self.parse)
  21. # 获取运行时命令行参数
  22. # url = 'http://quotes.toscrape.com/'
  23. # tag = getattr(self, 'tag', None)
  24. # if tag is not None:
  25. # url = url + 'tag/' +tag
  26. # yield scrapy.Request(url, self.parse)
  27. # 被调用时,每个初始URL完成下载后生成的 Response对象将会作为唯一的参数传递给该函数。
  28. # 该方法负责解析返回的数据(response data),
  29. # 提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。
  30. def parse(self, response):
  31. for quote in response.css('div.quote'):
  32. # 日志中输出
  33. yield {
  34. 'text': quote.css('span.text::text').get(),
  35. 'author': quote.xpath('span/small/text()').get(),
  36. }
  37. # 提取url方式一:发送Request
  38. # 最麻烦
  39. for next_page in response.css('li.next a::attr(href)'):
  40. # 可能是相对路径,所以要加这步
  41. next_page = response.urljoin(next_page)
  42. # 回调parse,递归爬取
  43. yield scrapy.Request(next_page, callback=self.parse)
  44. # 提取url方式二:response.follow(strHref)
  45. # for href in response.css('li.next a::attr(href)'):
  46. # yield response.follow(href, callback=self.parse)
  47. # 提取url方式三:response.follow(元素)
  48. # 最简洁
  49. # for a in response.css('li.next a'):
  50. # yield response.follow(a, callback=self.parse)

(五)运行项目

  1. #运行quotes爬虫
  2. scrapy crawl quotes
  3. #存储爬取数据之json
  4. #当前目录下生成一个quotes.json文件,出于历史原因,scrapy会附加到给定的文件(append),而不是覆盖其内容。如果在第二次运行此命令两次之前不删除该文件,则最终会得到一个损坏的JSON文件。
  5. scrapy crawl quotes -o quotes.json
  6. #命令行传参:-a tag=fuck
  7. #它将传到爬虫逻辑的__init__方法中,变成爬虫的默认参数。
  8. scrapy crawl quotes -o quotes-humor.json -a tag=humor
  9. #存储爬取数据之JSON Lines
  10. #JSON Lines非常有用,类似于流,因为输出文件可以直接append而不会解析失败,另外每一条记录记录都是独立行方便处理大量数据。
  11. scrapy crawl quotes -o quotes.jl
  12. #上面可以应付直接存储直接爬取到的数据。
  13. #存储更复杂的数据比如图片视频就要用到Item Pipeline,
  14. #在pipelines.py中

(六)具体发生了什么?

Scrapy安排了 scrapy.Request 返回的对象 start_requests 蜘蛛的方法。在接收到每个响应时,它实例化 Response 对象并调用与请求关联的回调方法(在本例中,为 parse 方法)将响应作为参数传递。

(七)递归爬取

不能就爬取start_urls或者start_requests里面的几条链接,很可能是要爬取网站下所有。那就需要在爬取的网页中提取链接。

  1. import scrapy
  2. class QuotesSpider(scrapy.Spider):
  3. name = "quotes"
  4. start_urls = [
  5. 'http://quotes.toscrape.com/page/1/',
  6. ]
  7. def parse(self, response):
  8. for quote in response.css('div.quote'):
  9. yield {
  10. 'text': quote.css('span.text::text').get(),
  11. 'author': quote.css('small.author::text').get(),
  12. 'tags': quote.css('div.tags a.tag::text').getall(),
  13. }
  14. #提取页面中的链接
  15. for a in response.css('li.next a'):
  16. yield response.follow(a, callback=self.parse)
  17. #response.follow,既可以接收string参数也可以接收元素类型。
  18. #注意,下面这样是错误的,因为css返回的是列表
  19. response.follow(response.css('li.next a'), callback=self.parse)
  20. #但是下面这样,也是可以的。
  21. response.follow(response.css('li.next a')[0], callback=self.parse)

(八)调试抓取代码

爬虫一运行就停不下来,肯定不能正常调试,Scrapy Shell就可以通过命令行方式执行一次抓取,这样让调试就变得非常简单了。

三、Scrapy Shell教程

可交互的shell,可以输入代码调试抓取代码,无需运行爬虫,即可调试抓取代码,非常有用。

(一)配置Shell

1、方法一

配置环境变量:SCRAPY_PYTHON_SHELL,路径

  1. Python\Lib\site-packages\scrapy
  2. Python\Lib\site-packages\scrapy\commands
  3. #两个下面都有shell.py不知道是哪个

2、方法二

在scrapy.cfg中定义

  1. [settings]
  2. shell = bpython

(二)抓取一次数据

  1. #抓取网页链接
  2. #注意要带上引号,windows上要用双引号,unix系统要用单引号。
  3. scrapy shell 'http://quotes.toscrape.com/page/1/'
  4. #抓取本地文件,如果是本地文件时,注意shell先识别URL再识别路径。
  5. # UNIX-style
  6. scrapy shell ./path/to/file.html
  7. scrapy shell ../other/path/to/file.html
  8. scrapy shell /absolute/path/to/file.html
  9. #File URI
  10. scrapy shell file:///absolute/path/to/file.html
  11. scrapy shell index.html
  12. #不能正常运行,因为index.html被认为是url,html是相当于com

结果如下:

  1. [ ... Scrapy log here ... ]
  2. 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
  3. [s] Available Scrapy objects:
  4. [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
  5. [s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
  6. [s] item {}
  7. [s] request <GET http://quotes.toscrape.com/page/1/>
  8. [s] response <200 http://quotes.toscrape.com/page/1/>
  9. [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
  10. [s] spider <DefaultSpider 'default' at 0x7fa91c8af990>
  11. [s] Useful shortcuts:
  12. [s] shelp() Shell help (print this help)
  13. [s] fetch(req_or_url) Fetch request (or URL) and update local objects
  14. [s] view(response) View response in a browser
  15. >>>

(三)使用css抽取数据

  1. #返回的是[]形式,是列表,可能有多个
  2. >>> response.css('title')
  3. [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
  4. #提取title的文本,注意返回的是列表
  5. >>> response.css('title::text').getall()
  6. ['Quotes to Scrape']
  7. #提取的是title整个元素,注意是列表
  8. >>> response.css('title').getall()
  9. ['<title>Quotes to Scrape</title>']
  10. #提取列表第一个title的文本
  11. >>> response.css('title::text').get()
  12. 'Quotes to Scrape'
  13. #同上,但是如果一个没有[0]就会报错,不建议
  14. >>> response.css('title::text')[0].get()
  15. 'Quotes to Scrape'
  16. >>> response.css('title::text').get()
  17. 'Quotes to Scrape'
  18. #正则表达是提取
  19. >>> response.css('title::text').re(r'Quotes.*')
  20. ['Quotes to Scrape']
  21. #正则表达是提取
  22. >>> response.css('title::text').re(r'Q\w+')
  23. ['Quotes']
  24. #正则表达是提取
  25. >>> response.css('title::text').re(r'(\w+) to (\w+)')
  26. ['Quotes', 'Scrape']

(四)Selector Gadget 获取CSS

很好的工具,可以快速找到视觉上选中的元素的CSS选择器,它可以在许多浏览器中使用。

(五)除了CSS,也可以XPath

css底层其实就是转换成xpath,但是css明显要更流行。

  1. >>> response.xpath('//title')
  2. [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
  3. >>> response.xpath('//title/text()').get()
  4. 'Quotes to Scrape'

四、Shell实践

Http://quotes.toscrape.com的HTML结构如下:

  1. <div class="quote">
  2. <span class="text">“The world as we have created it is a process of our
  3. thinking. It cannot be changed without changing our thinking.”</span>
  4. <span>
  5. by <small class="author">Albert Einstein</small>
  6. <a href="/author/Albert-Einstein">(about)</a>
  7. </span>
  8. <div class="tags">
  9. Tags:
  10. <a class="tag" href="/tag/change/page/1/">change</a>
  11. <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
  12. <a class="tag" href="/tag/thinking/page/1/">thinking</a>
  13. <a class="tag" href="/tag/world/page/1/">world</a>
  14. </div>
  15. </div>
  1. scrapy shell 'http://quotes.toscrape.com'
  2. >>> response.css("div.quote")
  3. >>> quote = response.css("div.quote")[0]
  4. >>> text = quote.css("span.text::text").get()
  5. >>> text
  6. '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
  7. >>> author = quote.css("small.author::text").get()
  8. >>> author
  9. 'Albert Einstein'
  10. >>> tags = quote.css("div.tags a.tag::text").getall()
  11. >>> tags
  12. ['change', 'deep-thoughts', 'thinking', 'world']
  13. >>> for quote in response.css("div.quote"):
  14. ... text = quote.css("span.text::text").get()
  15. ... author = quote.css("small.author::text").get()
  16. ... tags = quote.css("div.tags a.tag::text").getall()
  17. ... print(dict(text=text, author=author, tags=tags))
  18. {'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
  19. {'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
  20. ... a few more of these, omitted for brevity
  21. >>>