scrapy配置文件是 scrapy.cfg
,基本结构如下:
[settings]
default = ImageCrawler.settings1
[deploy]
;url = http://localhost:6800/
project = ImageCrawler
多配置切换
如果项目中包括多个配置,比如有如下目录结构:
可以在scrapy.cfg
中配置:
[settings]
default = ImageCrawler.settings.settings1
project1 = ImageCrawler.settings.settings1
project2 = ImageCrawler.settings.settings2
project3 = ImageCrawler.settings.settings3
[deploy]
project = ImageCrawler
切换配置需要设置环境变量 SCRAPY_PROJECT
,Windows下临时设置环境的命令如下:
set SCRAPY_PROJECT=project1
设置好后可以获取当前配置的名称,看是否切换成功:
scrapy settings --get BOT_NAME
然后再进行爬取即可:
scrapy crawl images1
配置文件详解
常见的配置字段如下:
# -*- coding: utf-8 -*-
# 爬虫名称
BOT_NAME = 'ImageCrawler1'
# 爬虫模块
SPIDER_MODULES = ['ImageCrawler.spiders']
NEWSPIDER_MODULE = 'ImageCrawler.spiders'
# 用户代理
# USER_AGENT = 'ImageCrawler (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# 默认请求头
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }
# 中间件
# SPIDER_MIDDLEWARES = {
# 'ImageCrawler.middlewares.ImageCrawlerSpiderMiddleware': 543,
# }
# 下载中间件
# DOWNLOADER_MIDDLEWARES = {
# 'ImageCrawler.middlewares.MyCustomDownloaderMiddleware': 543,
# }
# 扩展
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }
# 管道
ITEM_PIPELINES = {
'ImageCrawler.pipelines.ImageCrawlerPipeline': 300,
}
获取某个字段使用命令:
scrapy settings --get 字段名