Spider类
Spider类定义了如何爬取某个(或某些)网站,包括了爬取的动作(例如:是否跟进链接),以及如何从网页中提取结构化数据(Item)。
换句话说,Spider就是你定义爬取动作及分析网页方法的地方。class scrapy.Spider是最基本的类,所有编写的爬虫必须继承这个类。
主要用到的函数即调用顺序
__init__():初始化爬虫名字和start_urls列表。start_requests():该方法会调用make_request_from_url,生成Request对象交给scrapy下载器,下载器下载并返回responseparse():解析response,并返回Item或Request(需指定回调函数)。Item传给Item Pipeline进行持久化,而Request交给下载器,并指定回调函数处理返回的response,默认为parse(),一直循环,直到处理完所有的数据为止。源码参考
class Spider(object_ref):"""Base class for scrapy spiders. All spiders must inherit from thisclass."""name: Optional[str] = Nonecustom_settings: Optional[dict] = Nonedef __init__(self, name=None, **kwargs):if name is not None:self.name = nameelif not getattr(self, 'name', None):raise ValueError(f"{type(self).__name__} must have a name")self.__dict__.update(kwargs)if not hasattr(self, 'start_urls'):self.start_urls = []@propertydef logger(self):logger = logging.getLogger(self.name)return logging.LoggerAdapter(logger, {'spider': self})def log(self, message, level=logging.DEBUG, **kw):"""Log the given message at the given log levelThis helper wraps a log call to the logger within the spider, but youcan use it directly (e.g. Spider.logger.info('msg')) or use any otherPython logger too."""self.logger.log(level, message, **kw)@classmethoddef from_crawler(cls, crawler, *args, **kwargs):spider = cls(*args, **kwargs)spider._set_crawler(crawler)return spiderdef _set_crawler(self, crawler):self.crawler = crawlerself.settings = crawler.settingscrawler.signals.connect(self.close, signals.spider_closed)# 该方法将读取start_urls中的URL地址,并为每个URL地址生成一个Request对象,# 并交给scrapy engine进下载并返回Response对象# 该方法进在爬虫开始时调用一次def start_requests(self):cls = self.__class__if not self.start_urls and hasattr(self, 'start_url'):raise AttributeError("Crawling could not start: 'start_urls' not found ""or empty (but found 'start_url' attribute instead, ""did you miss an 's'?)")if method_is_overridden(cls, Spider, 'make_requests_from_url'):warnings.warn("Spider.make_requests_from_url method is deprecated; it ""won't be called in future Scrapy releases. Please ""override Spider.start_requests method instead "f"(see {cls.__module__}.{cls.__name__}).",)for url in self.start_urls:yield self.make_requests_from_url(url)else:for url in self.start_urls:# dont_filter=True参数是告诉调度器不需要做去重处理。yield Request(url, dont_filter=True)# 该方法已被弃用def make_requests_from_url(self, url):""" This method is deprecated. """warnings.warn("Spider.make_requests_from_url method is deprecated: ""it will be removed and not be called by the default ""Spider.start_requests method in future Scrapy releases. ""Please override Spider.start_requests method instead.")return Request(url, dont_filter=True)def _parse(self, response, **kwargs):return self.parse(response, **kwargs)# 默认的Request对象回调函数,处理返回的response# 生成Item或Request对象,用户必须对该方法进行实现def parse(self, response, **kwargs):raise NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')
主要属性和方法
- name
定义spider名称的属性
定义规则:如果spider爬取的域名为mywebsite.com, 则该spider通常会被命名为mywebsite
- allowed_domains
包含了spider允许爬取的域名(domain)列表。可选
- start_urls
初始URL列表。当没有特定的URL时,spider将从该列表中的URL开始爬取
- start_requests(self)
该方法必须返回一个可迭代对象(iterable)。该对象包含了spider用于爬取的第一个Request对象。
- parse(self, spider)
当请求URL返回网页没有指定回调函数时,默认的Request对象回调函数。用来处理网页返回的response,以及生成Item或者Request对象。
