7、scrapy框架 - 7.3 Scrapy中CrawlSpider - 《Python爬虫》

1、LinkExtractors链接提取器
2、Rule规则类
CrawlSpider爬取微信小程序社区

之前的代码中，我们有很大一部分时间在寻找下一页的URL地址或者内容的URL地址上面，这个过程能更简单一些吗？

思路：
1.从response中提取所有的a标签对应的URL地址
2.自动的构造自己resquests请求，发送给引擎

目标：通过爬虫了解crawlspider的使用

生成crawlspider的命令：scrapy genspider -t crawl 爬虫名字域名

1、LinkExtractors链接提取器

使用LinkExtractors可以不用程序员自己提取想要的url，然后发送请求。这些工作都可以交给LinkExtractors，他会在所有爬的页面中找到满足规则的url，实现自动的爬取。

class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)

主要参数讲解：

allow：允许的url。所有满足这个正则表达式的url都会被提取。
deny：禁止的url。所有满足这个正则表达式的url都不会被提取。
allow_domains：允许的域名。只有在这个里面指定的域名的url才会被提取。
deny_domains：禁止的域名。所有在这个里面指定的域名的url都不会被提取。
restrict_xpaths：严格的xpath。和allow共同过滤链接。

2、Rule规则类
定义爬虫的规则类。

class scrapy.spiders.Rule(
    link_extractor, 
    callback = None, 
    cb_kwargs = None, 
    follow = None, 
    process_links = None, 
    process_request = None
)

主要参数讲解：

link_extractor：一个LinkExtractor对象，用于定义爬取规则。
callback：满足这个规则的url，应该要执行哪个回调函数。因为CrawlSpider使用了parse作为回调函数，因此不要覆盖parse作为回调函数自己的回调函数。
follow：指定根据该规则从response中提取的链接是否需要跟进。
process_links：从link_extractor中获取到链接后会传递给这个函数，用来过滤不需要爬取的链接。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CygSpider(CrawlSpider):
    name = 'cyg'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']
    rules = (
        Rule(LinkExtractor(allow=r'http:\/\/wz\.sun0769\.com\/political\/index\/politicsNewest\?id=1&page=\d+'), follow=True),
        Rule(LinkExtractor(allow=r'http:\/\/wz\.sun0769\.com\/political\/politics\/index\?id=\d+'), callback='parse_item'),
    )
    def parse_item(self, response):
        content_dict = {}
        content_dict['username'] = response.xpath("//span[@class='fl details-head']/text()").extract()[-1].strip()
        content_dict['content'] = response.xpath("//div[@class='details-box']/pre/text()").extract_first().strip()
        img = response.xpath("//div[@class='clear details-img-list Picture-img']/img/@src").extract()
        if img:
            content_dict['img'] = img
        else:
            content_dict['img'] = "暂无图片"
        yield content_dict

注意点
1.用命令创建一个crawlspider的模板:scrapy genspider -t crawl <爬虫名字> ,也可以手动创建
2.CrawlSpider中不能再有以parse为名字的数据提取方法，这个方法被CrawlSpider用来实现基础URL提取等功能
3.一个Rule对象接受很多参数，首先第一个是包含URL规则的LinkExtractor对象，常用的还有callback和follow

callback:连接提取器提取出来的URL地址对应的响应交给他处理
follow:连接提取器提取出来的URL地址对应的响应是否继续被rules来过滤

4.不指定callback函数的请求下，如果follow为True，满足该rule的URL还会继续被请求
5.如果多个Rule都满足某一个URL，会从rules中选择第一个满足的进行操作

CrawlSpider爬取微信小程序社区

http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wechatapp.items import WechatappItem
class WxSpider(CrawlSpider):
    name = 'wx'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    rules = (
        Rule(LinkExtractor(allow=r'http:\/\/www\.wxapp-union\.com\/portal\.php\?mod=list&catid=2&page=\d+'), follow=True),
        Rule(LinkExtractor(allow=r'http:\/\/www\.wxapp-union\.com\/article-\d-1\.html'), callback='parse_item'),
    )
    def parse_item(self, response):
        print('-' * 30)
        title = response.xpath("//h1[@class='ph']/text()").extract_first()
        author = response.xpath("//p[@class='authors']/a/text()").extract_first()
        content = response.xpath("//td[@id='article_content']/p/text()").extract()
        build_time = response.xpath("//span[@class='time']/text()").extract_first()
        item = WechatappItem(title=title, author=author, content=content, build_time=build_time)
        print('-'*30)
        print(item)
        yield item

7.3 Scrapy中CrawlSpider

1、LinkExtractors链接提取器

2、Rule规则类

CrawlSpider爬取微信小程序社区