Spider类

Spider类定义了如何爬取某个(或某些)网站,包括了爬取的动作(例如:是否跟进链接),以及如何从网页中提取结构化数据(Item)。
换句话说,Spider就是你定义爬取动作及分析网页方法的地方。
class scrapy.Spider是最基本的类,所有编写的爬虫必须继承这个类。

主要用到的函数即调用顺序

  • __init__():初始化爬虫名字和start_urls列表。
  • start_requests():该方法会调用make_request_from_url,生成Request对象交给 scrapy下载器下载器下载并返回response
  • parse():解析response,并返回Item或Request(需指定回调函数)。Item传给Item Pipeline进行持久化,而Request交给下载器,并指定回调函数处理返回的response,默认为parse(),一直循环,直到处理完所有的数据为止。

    源码参考

    1. class Spider(object_ref):
    2. """Base class for scrapy spiders. All spiders must inherit from this
    3. class.
    4. """
    5. name: Optional[str] = None
    6. custom_settings: Optional[dict] = None
    7. def __init__(self, name=None, **kwargs):
    8. if name is not None:
    9. self.name = name
    10. elif not getattr(self, 'name', None):
    11. raise ValueError(f"{type(self).__name__} must have a name")
    12. self.__dict__.update(kwargs)
    13. if not hasattr(self, 'start_urls'):
    14. self.start_urls = []
    15. @property
    16. def logger(self):
    17. logger = logging.getLogger(self.name)
    18. return logging.LoggerAdapter(logger, {'spider': self})
    19. def log(self, message, level=logging.DEBUG, **kw):
    20. """Log the given message at the given log level
    21. This helper wraps a log call to the logger within the spider, but you
    22. can use it directly (e.g. Spider.logger.info('msg')) or use any other
    23. Python logger too.
    24. """
    25. self.logger.log(level, message, **kw)
    26. @classmethod
    27. def from_crawler(cls, crawler, *args, **kwargs):
    28. spider = cls(*args, **kwargs)
    29. spider._set_crawler(crawler)
    30. return spider
    31. def _set_crawler(self, crawler):
    32. self.crawler = crawler
    33. self.settings = crawler.settings
    34. crawler.signals.connect(self.close, signals.spider_closed)
    35. # 该方法将读取start_urls中的URL地址,并为每个URL地址生成一个Request对象,
    36. # 并交给scrapy engine进下载并返回Response对象
    37. # 该方法进在爬虫开始时调用一次
    38. def start_requests(self):
    39. cls = self.__class__
    40. if not self.start_urls and hasattr(self, 'start_url'):
    41. raise AttributeError(
    42. "Crawling could not start: 'start_urls' not found "
    43. "or empty (but found 'start_url' attribute instead, "
    44. "did you miss an 's'?)")
    45. if method_is_overridden(cls, Spider, 'make_requests_from_url'):
    46. warnings.warn(
    47. "Spider.make_requests_from_url method is deprecated; it "
    48. "won't be called in future Scrapy releases. Please "
    49. "override Spider.start_requests method instead "
    50. f"(see {cls.__module__}.{cls.__name__}).",
    51. )
    52. for url in self.start_urls:
    53. yield self.make_requests_from_url(url)
    54. else:
    55. for url in self.start_urls:
    56. # dont_filter=True参数是告诉调度器不需要做去重处理。
    57. yield Request(url, dont_filter=True)
    58. # 该方法已被弃用
    59. def make_requests_from_url(self, url):
    60. """ This method is deprecated. """
    61. warnings.warn(
    62. "Spider.make_requests_from_url method is deprecated: "
    63. "it will be removed and not be called by the default "
    64. "Spider.start_requests method in future Scrapy releases. "
    65. "Please override Spider.start_requests method instead."
    66. )
    67. return Request(url, dont_filter=True)
    68. def _parse(self, response, **kwargs):
    69. return self.parse(response, **kwargs)
    70. # 默认的Request对象回调函数,处理返回的response
    71. # 生成Item或Request对象,用户必须对该方法进行实现
    72. def parse(self, response, **kwargs):
    73. raise NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')

    主要属性和方法

  • name

定义spider名称的属性
定义规则:如果spider爬取的域名为mywebsite.com, 则该spider通常会被命名为mywebsite

  • allowed_domains

包含了spider允许爬取的域名(domain)列表。可选

  • start_urls

初始URL列表。当没有特定的URL时,spider将从该列表中的URL开始爬取

  • start_requests(self)

该方法必须返回一个可迭代对象(iterable)。该对象包含了spider用于爬取的第一个Request对象。

  • parse(self, spider)

当请求URL返回网页没有指定回调函数时,默认的Request对象回调函数。用来处理网页返回的response,以及生成Item或者Request对象。