scrapy配置文件是 scrapy.cfg,基本结构如下:

  1. [settings]
  2. default = ImageCrawler.settings1
  3. [deploy]
  4. ;url = http://localhost:6800/
  5. project = ImageCrawler

多配置切换

如果项目中包括多个配置,比如有如下目录结构:
image.png

可以在scrapy.cfg 中配置:

[settings]
default = ImageCrawler.settings.settings1
project1 = ImageCrawler.settings.settings1
project2 = ImageCrawler.settings.settings2
project3 = ImageCrawler.settings.settings3

[deploy]
project = ImageCrawler

切换配置需要设置环境变量 SCRAPY_PROJECT ,Windows下临时设置环境的命令如下:

set SCRAPY_PROJECT=project1

设置好后可以获取当前配置的名称,看是否切换成功:

scrapy settings --get BOT_NAME

然后再进行爬取即可:

scrapy crawl images1

配置文件详解

常见的配置字段如下:

# -*- coding: utf-8 -*-

# 爬虫名称
BOT_NAME = 'ImageCrawler1'

# 爬虫模块
SPIDER_MODULES = ['ImageCrawler.spiders']
NEWSPIDER_MODULE = 'ImageCrawler.spiders'

# 用户代理
# USER_AGENT = 'ImageCrawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# 默认请求头
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# 中间件
# SPIDER_MIDDLEWARES = {
#    'ImageCrawler.middlewares.ImageCrawlerSpiderMiddleware': 543,
# }

# 下载中间件
# DOWNLOADER_MIDDLEWARES = {
#    'ImageCrawler.middlewares.MyCustomDownloaderMiddleware': 543,
# }

# 扩展
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# 管道
ITEM_PIPELINES = {
    'ImageCrawler.pipelines.ImageCrawlerPipeline': 300,
}

获取某个字段使用命令:

scrapy settings --get 字段名

参考资料