Scrapy-分布式

什么是scrapy_redis

  1. scrapy_redis:Redis-based components for scrapy

github地址:https://github.com/rmax/scrapy-redis

回顾scrapy工作流程

图片4.png

scrapy_redis工作流程

  1. clone github scrapy_redis源码文件
  2. git clone https://github.com/rolando/scrapy-redis.git

scrapy_redis中的settings文件

  1. # Scrapy settings for example project
  2. #
  3. # For simplicity, this file contains only the most important settings by
  4. # default. All the other settings are documented here:
  5. #
  6. # http://doc.scrapy.org/topics/settings.html
  7. #
  8. SPIDER_MODULES = ['example.spiders']
  9. NEWSPIDER_MODULE = 'example.spiders'
  10. USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'
  11. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 指定那个去重方法给request对象去重
  12. SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 指定Scheduler队列
  13. SCHEDULER_PERSIST = True # 队列中的内容是否持久保存,为false的时候在关闭Redis的时候,清空Redis
  14. #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
  15. #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
  16. #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
  17. ITEM_PIPELINES = {
  18. 'example.pipelines.ExamplePipeline': 300,
  19. 'scrapy_redis.pipelines.RedisPipeline': 400, # scrapy_redis实现的items保存到redis的pipline
  20. }
  21. LOG_LEVEL = 'DEBUG'
  22. # Introduce an artifical delay to make use of parallelism. to speed up the
  23. # crawl.
  24. DOWNLOAD_DELAY = 1

scrapy_redis运行

  1. allowed_domains = ['dmoztools.net']
  2. start_urls = ['http://www.dmoztools.net/']
  3. scrapy crawl dmoz

运行结束后redis中多了三个键

  1. dmoz:requests 存放的是待爬取的requests对象
  2. dmoz:item 爬取到的信息
  3. dmoz:dupefilter 爬取的requests的指纹

19 - Scrapy进阶一 - 图2

当当爬虫

  • 需求:抓取当当图书的信息
  • 目标:抓取当当图书网图书所属大分类、图书所属中的分类、小分类。小分类的url地址、图片的名字、图片的url
  • URL:http://book.dangdang.com/