我们继续撸豆瓣,创建项目什么的不说了,直接开始爬虫的编写。

    爬虫的基本结构跟之前的一样:

    1. # -*- coding: utf-8 -*-
    2. import scrapy
    3. from douban_images.items import DoubanImagesItem
    4. class ImagesSpider(scrapy.Spider):
    5. name = 'images'
    6. allowed_domains = ['movie.douban.com']
    7. start_urls = ['https://movie.douban.com/top250/']
    8. def parse(self, response):
    9. doubanItem = DoubanImagesItem()
    10. items = response.xpath('//div[@class="item"]')
    11. for item in items:
    12. name = item.xpath('.//span[@class="title"]/text()').extract_first()
    13. pic_url = item.xpath('./div[@class="pic"]/a/img/@src').extract_first()
    14. doubanItem['name'] = name
    15. doubanItem['pic_url'] = pic_url
    16. yield doubanItem
    17. next = response.xpath('//div[@class="paginator"]//span[@class="next"]/a/@href').extract_first()
    18. if next is not None:
    19. next = response.urljoin(next)
    20. yield scrapy.Request(next, callback=self.parse)

    items.py 也一样:

    1. # -*- coding: utf-8 -*-
    2. import scrapy
    3. class DoubanImagesItem(scrapy.Item):
    4. name = scrapy.Field()
    5. pic_url = scrapy.Field()

    到了pipelines.py,情况有些不一样了,先上代码,后解释:

    1. # -*- coding: utf-8 -*-
    2. # Define your item pipelines here
    3. #
    4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    5. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    6. from scrapy.pipelines.images import ImagesPipeline
    7. from scrapy import Request
    8. class DoubanImagesPipeline(ImagesPipeline):
    9. # 获取图片资源
    10. def get_media_requests(self, item, info):
    11. name = item['name']
    12. pic_url = item['pic_url']
    13. # 请求传过来的图片地址以获取图片资源
    14. yield Request(pic_url, meta={'name': name})
    15. # 重命名,若不重写这函数,图片名为哈希,就是一串乱七八糟的名字
    16. def file_path(self, request, response=None, info=None):
    17. ext = request.url.split('.')[-1]
    18. name = request.meta['name'].strip()
    19. filename = u'{1}.{2}'.format(name, name, ext)
    20. return filename

    这里的话主要继承了scrapy的:ImagesPipeline这个类,我们需要在里面实现: get_media_requests(self, item, info) 这个方法,这个方法主要是把蜘蛛 yield 过来的图片链接执行下载。

    主体代码准备完毕,还需要修改配置文件settings.py

    1. USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    2. ROBOTSTXT_OBEY = False
    3. # 启动图片下载中间件
    4. ITEM_PIPELINES = {
    5. 'douban_images.pipelines.DoubanImagesPipeline': 300,
    6. }
    7. # 设置图片存储目录
    8. IMAGES_STORE = 'F:/images'

    scrapy 给我们提供了一个常量:IMAGES_STORE,用于定义存储图片的路径, 可以是绝对路径也可以是相对路径, 相对路径相对于项目根目录。

    开启爬虫:scrapy crawl images

    爬取完毕,可以看到目标文件夹里已经获取了250张图片:

    📃 批量下载图片 - 图1