下载中间件

中间件信息

  1. "scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
  2. "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
  3. "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
  4. "scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware": 400,
  5. "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 500,
  6. "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
  7. "scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware": 560,
  8. "scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware": 580,
  9. "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 590,
  10. "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 600,
  11. "scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
  12. "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
  13. "scrapy.downloadermiddlewares.stats.DownloaderStats": 850,
  14. "scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware": 900

1 进入项目 命令查看下载中间件信息

image.png

2 在settings文件中找到下载中间件信息并且存放信息

image.png

信息解读

1 数字越小越靠近引擎,越早执行该信息

image.png

2 middlewares中下载中间件的定义

image.png

3 对中间件的功能—-简单解读

  1. """
  2. "scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100, # 机器人协议中间件
  3. "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300, #http身份验证中间件
  4. "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350, # 下载超时中间件
  5. "scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware": 400, # 默认请求头
  6. "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 500, # 用户代理中间件
  7. "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550, # 重新尝试中间件
  8. "scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware": 560, # ajax抓取中间件
  9. "scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware": 580, # 始终使用字符串作为原因
  10. "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 590, # 允许网站数据压缩发送
  11. "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 600, # 重定向
  12. "scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700, # cookie 中间件
  13. "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750, # 代理中间件
  14. "scrapy.downloadermiddlewares.stats.DownloaderStats": 850, # 通过此中间件存储通过他的所有请求 响应
  15. "scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware": 900 # 缓存中间件
  16. """

4 常用内置中间件

  1. CookieMiddleware 支持cookie,通过设置COOKIES_ENABLED 来开启和关闭
  2. HttpProxyMiddleware HTTP代理,通过设置request.meta['proxy']的值来设置
  3. UserAgentMiddleware 与用户代理中间件。

5 四个重要的下载中间件解读
第一个 process_request

  1. def process_request(self, request, spider):
  2. # Called for each request that goes through the downloader
  3. # 功能: 处理请求 参数: request & spider对象 返回值
  4. # middleware.
  5. # Must either: 以下必选其一
  6. # - return None: continue processing this request 返回None request被继续交给下一个中间件进行处理
  7. # - or return a Response object 返回一个response对象,不会交给下一个process_request,而是直接交给下载器,最后传递给引擎
  8. # - or return a Request object 返回一个request对象 直接交给引擎处理
  9. # - or raise IgnoreRequest: process_exception() methods of # 抛出异常,由process_exception()处理
  10. # installed downloader middleware will be called
  11. return None

image.png
第二个 process_response

  1. def process_response(self, request, response, spider):
  2. # Called with the response returned from the downloader.
  3. # 功能 处理响应 类型 request, response, spider
  4. # Must either;
  5. # - return a Response object # 继续交给下一个中间件处理
  6. # - return a Request object # 返回一个request对象 直接交给引擎处理
  7. # - or raise IgnoreRequest # 抛出异常 使用process_exception处理

image.png

第三个 process_exception

  1. def process_exception(self, request, exception, spider):
  2. # Called when a download handler or a process_request()
  3. # (from other downloader middleware) raises an exception.
  4. # 处理异常
  5. # Must either:
  6. # - return None: continue processing this exception # 继续调用其他中间件的process_exception
  7. # - return a Response object: stops process_exception() chain # 停止调用其他中间件的process_exception
  8. # - return a Request object: stops process_exception() chain # 直接交给引擎处理
  9. pass

image.png
三种函数横向比较
image.png

自定义User-Agent中间件

多种不同的用户代理用来模拟浏览器登录

image.png

在middlewares中间件中自定义代理中间件

  1. class User_AgentDownloaderMiddleware(object):
  2. def process_request(self, request, spider):
  3. # Called for each request that goes through the downloader
  4. # middleware.
  5. # Must either:
  6. # - return None: continue processing this request
  7. # - or return a Response object
  8. # - or return a Request object
  9. # - or raise IgnoreRequest: process_exception() methods of
  10. # installed downloader middleware will be called
  11. return None
  12. def process_response(self, request, response, spider):
  13. # Called with the response returned from the downloader.
  14. # Must either;
  15. # - return a Response object
  16. # - return a Request object
  17. # - or raise IgnoreRequest
  18. return response
  19. def process_exception(self, request, exception, spider):
  20. # Called when a download handler or a process_request()
  21. # (from other downloader middleware) raises an exception.
  22. # Must either:
  23. # - return None: continue processing this exception
  24. # - return a Response object: stops process_exception() chain
  25. # - return a Request object: stops process_exception() chain
  26. pass

image.png

将浏览器代理信息复制到settings文件中,以包的形式导入进middlewares

尤其应该注意路径名的书写,同级目录下导入文件,直接写文件名即可
image.png

与以上操作相同

将IP代理信息复制到settings,在middlewares中重新制作一个自定义代理下载中间件,同时从settings中导入到此文件

细节:注意IP地址

  1. def process_request(self, request, spider):
  2. # Called for each request that goes through the downloader
  3. # middleware.
  4. # Must either:
  5. # - return None: continue processing this request
  6. # - or return a Response object
  7. # - or return a Request object
  8. # - or raise IgnoreRequest: process_exception() methods of
  9. # installed downloader middleware will be called
  10. return None
  11. def process_response(self, request, response, spider):
  12. # Called with the response returned from the downloader.
  13. # Must either;
  14. # - return a Response object
  15. # - return a Request object
  16. # - or raise IgnoreRequest
  17. return response
  18. def process_exception(self, request, exception, spider):
  19. # Called when a download handler or a process_request()
  20. # (from other downloader middleware) raises an exception.
  21. # Must either:
  22. # - return None: continue processing this exception
  23. # - return a Response object: stops process_exception() chain
  24. # - return a Request object: stops process_exception() chain
  25. pass

image.png

开启下载中间件

image.png

关闭特定下载中间件

  1. DOWNLOADER_MIDDLEWARES = {
  2. 'github.middlewares.GithubDownloaderMiddleware': 543,
  3. 'scrapy.downloadermiddlewares.retry.RetryMiddleware' : None
  4. }

image.png

设定指定下载中间件的优先级

  1. DOWNLOADER_MIDDLEWARES = {
  2. 'github.middlewares.GithubDownloaderMiddleware': 543,
  3. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware' : 1024,
  4. 'scrapy.downloadermiddlewares.retry.RetryMiddleware' : None
  5. }

image.png

常用Scrapy.settings

第一部分

  1. BOT_NAME = 'baidu' # scrapy 项目名称
  2. SPIDER_MODULES = ['baidu.spiders'] # 爬虫模块
  3. NEWSPIDER_MODULE = 'baidu.spiders' # 使用genspider 命令创建的爬虫模板

image.png

第二部分

  1. # Obey robots.txt rules
  2. ROBOTSTXT_OBEY = False
  3. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  4. CONCURRENT_REQUESTS = 32 # 通过scrapy设置最大的并发请求
  5. # Configure a delay for requests for the same website (default: 0)
  6. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
  7. # See also autothrottle settings and docs
  8. DOWNLOAD_DELAY = 3

image.png

第三部分

  1. # Obey robots.txt rules
  2. ROBOTSTXT_OBEY = False
  3. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  4. CONCURRENT_REQUESTS = 32 # 通过scrapy设置最大的并发请求
  5. # Configure a delay for requests for the same website (default: 0)
  6. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
  7. # See also autothrottle settings and docs
  8. DOWNLOAD_DELAY = 3
  9. # The download delay setting will honor only one of:
  10. # CONCURRENT_REQUESTS_PER_DOMAIN = 16 # 单个域名运行的并发请求
  11. CONCURRENT_REQUESTS_PER_IP = 16 # 单个IP允许运行的并发请求
  12. # Disable cookies (enabled by default)
  13. COOKIES_ENABLED = False
  14. # Disable Telnet Console (enabled by default)
  15. # TELNETCONSOLE_ENABLED = False # 设置控制台监控

image.png

image.png

第四部分

image.png

第五部分

  1. # Enable or disable extensions
  2. # See https://docs.scrapy.org/en/latest/topics/extensions.html
  3. # 选择开启或者关闭监听控制台
  4. EXTENSIONS = {
  5. 'scrapy.extensions.telnet.TelnetConsole': None,
  6. }
  7. # Configure item pipelines
  8. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  9. ITEM_PIPELINES = {
  10. 'baidu.pipelines.BaiduPipeline': 300, # 数字代表优先级 数字越大优先级越高
  11. }
  12. # 启用或者配置扩展
  13. # Enable and configure the AutoThrottle extension (disabled by default)
  14. # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
  15. # AUTOTHROTTLE_ENABLED = True
  16. # The initial download delay # 初始下载延迟
  17. AUTOTHROTTLE_START_DELAY = 5
  18. # The maximum download delay to be set in case of high latencies # 最大下载延迟
  19. AUTOTHROTTLE_MAX_DELAY = 60
  20. # The average number of requests Scrapy should be sending in parallel to
  21. # each remote server
  22. # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  23. # 启用显示每一个响应的调节信息
  24. # Enable showing throttling stats for every response received:
  25. AUTOTHROTTLE_DEBUG = False

image.png

项目总结

这一个项目没有具体的构造一个获取数据的过程,主要是学习三个主要的下载中间件,之后用户配置下载中间件,最后讲解一些主要的settings设置。

这一节是对前面项目的补充。之前对settings的了解主要停留在设置延时,关闭对爬虫协议的遵循,然后是设置请求头,最后是开启日志。而在这一个章节,我们不仅学习了自定义下载中间件,设置爬虫的IP代理和用户池,还监视了一个叫做控制台监听的功能,虽然不知道具体做什么,但开了一个头。