仍然以之前抓取豆瓣电影Top250为例。

    爬虫程序跟之前抓取豆瓣电影保存到纯文本文件一致:

    1. # -*- coding: utf-8 -*-
    2. import scrapy
    3. from douban_mongodb.items import DoubanMongodbItem
    4. class Top250Spider(scrapy.Spider):
    5. name = 'top250'
    6. allowed_domains = ['movie.douban.com']
    7. start_urls = ['https://movie.douban.com/top250/']
    8. def parse(self, response):
    9. doubanItem = DoubanMongodbItem()
    10. items = response.xpath('//div[@class="item"]')
    11. for item in items:
    12. title = item.xpath('.//span[@class="title"]/text()').extract_first()
    13. detail_page_url = item.xpath('./div[@class="pic"]/a/@href').extract_first()
    14. star = item.xpath('.//span[@class="rating_num"]/text()').extract_first()
    15. pic_url = item.xpath('./div[@class="pic"]/a/img/@src').extract_first()
    16. doubanItem['title'] = title
    17. doubanItem['detail_page_url'] = detail_page_url
    18. doubanItem['star'] = star
    19. doubanItem['pic_url'] = pic_url
    20. yield doubanItem
    21. next = response.xpath('//div[@class="paginator"]//span[@class="next"]/a/@href').extract_first()
    22. if next is not None:
    23. next = response.urljoin(next)
    24. yield scrapy.Request(next, callback=self.parse)

    提取的item也一致:

    1. # -*- coding: utf-8 -*-
    2. import scrapy
    3. class DoubanMongodbItem(scrapy.Item):
    4. title = scrapy.Field()
    5. detail_page_url = scrapy.Field()
    6. star = scrapy.Field()
    7. pic_url = scrapy.Field()

    配置也一致:

    1. USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    2. ROBOTSTXT_OBEY = False
    3. # 启动图片下载中间件
    4. ITEM_PIPELINES = {
    5. 'douban_mongodb.pipelines.DoubanMongodbPipeline': 300,
    6. }

    唯一需要改变的就是Pipeline。

    连接数据库, 我们需要使用到 pymongo 模块, 使用 pip install pymongo 进行安装。

    __init__ 中进行数据库连接的初始化操作,在 process_item 方法中将数据转化为字典进行存储即可。

    1. # -*- coding: utf-8 -*-
    2. from pymongo import MongoClient
    3. class DoubanMongodbPipeline(object):
    4. def __init__(self, databaseIp='127.0.0.1', databasePort=27017, mongodbName='test'):
    5. client = MongoClient(databaseIp, databasePort)
    6. self.db = client[mongodbName]
    7. # 我的MongoDB无密码,如果有密码可以使用以下代码认证
    8. # self.db.authenticate(user, password)
    9. def process_item(self, item, spider):
    10. postItem = dict(item) # 把item转化成字典形式
    11. self.db.scrapy.insert(postItem) # 向数据库插入一条记录
    12. return item # 会在控制台输出原item数据,可以选择不写

    ok,执行爬虫程序,可以看到豆瓣电影top250的信息都已经存储到数据库了:

    📃 获取数据并保存到MongoDB - 图1