目的:我们制作一个爬虫,用来爬取豆瓣网评分比较高的电影,并保存到文件。

地址:https://movie.douban.com/top250

📃 捕捉网页数据并保存为纯文本文件 - 图1

分析页面结构

主体结构分析

📃 捕捉网页数据并保存为纯文本文件 - 图2

“下一页”按钮分析:

📃 捕捉网页数据并保存为纯文本文件 - 图3

嵌套很明显了,很简单的页面结构,接下来就开始编写爬虫。

创建scrapy项目

  1. scrapy startproject douban
  2. cd douban
  3. scrapy genspider top250 movie.douban.com/top250

创建好的项目结构如下:

📃 捕捉网页数据并保存为纯文本文件 - 图4

打开Top250Spider.py可以看到蜘蛛文件如下:

  1. class Top250Spider(scrapy.Spider):
  2. name = 'top250'
  3. allowed_domains = ['movie.douban.com/top250']
  4. start_urls = ['http://movie.douban.com/top250/']
  5. def parse(self, response):
  6. pass

爬取数据

根据上面的分析,我们很容易写出爬取主体数据的代码:

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class Top250Spider(scrapy.Spider):
  4. name = 'top250'
  5. allowed_domains = ['movie.douban.com']
  6. start_urls = ['https://movie.douban.com/top250/']
  7. def parse(self, response):
  8. items = response.xpath('//div[@class="item"]')
  9. for item in items:
  10. title = item.xpath('.//span[@class="title"]/text()').extract_first()
  11. detail_page_url = item.xpath('./div[@class="pic"]/a/@href').extract_first()
  12. star = item.xpath('.//span[@class="rating_num"]/text()').extract_first()
  13. pic_url = item.xpath('./div[@class="pic"]/a/img/@src').extract_first()

我们分析了,页面中存在“下一页”按钮,逻辑是:当“下一页”按钮不存在时,停止爬取,若存在,继续解析:

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class Top250Spider(scrapy.Spider):
  4. name = 'top250'
  5. allowed_domains = ['movie.douban.com']
  6. start_urls = ['https://movie.douban.com/top250/']
  7. def parse(self, response):
  8. doubanItem = DoubanItem()
  9. items = response.xpath('//div[@class="item"]')
  10. for item in items:
  11. title = item.xpath('.//span[@class="title"]/text()').extract_first()
  12. detail_page_url = item.xpath('./div[@class="pic"]/a/@href').extract_first()
  13. star = item.xpath('.//span[@class="rating_num"]/text()').extract_first()
  14. pic_url = item.xpath('./div[@class="pic"]/a/img/@src').extract_first()
  15. next = response.xpath('//div[@class="paginator"]//span[@class="next"]/a/@href').extract_first()
  16. if next is not None:
  17. next = response.urljoin(next)
  18. yield scrapy.Request(next, callback=self.parse)

提取到Items

爬虫主体创建好了,我们需要将其有效信息提取到Item,编写items.py文件如下:

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class DoubanItem(scrapy.Item):
  4. title = scrapy.Field()
  5. detail_page_url = scrapy.Field()
  6. star = scrapy.Field()
  7. pic_url = scrapy.Field()

修改蜘蛛文件Top250Spider.py

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from douban.items import DoubanItem
  4. class Top250Spider(scrapy.Spider):
  5. name = 'top250'
  6. allowed_domains = ['movie.douban.com']
  7. start_urls = ['https://movie.douban.com/top250/']
  8. def parse(self, response):
  9. doubanItem = DoubanItem()
  10. items = response.xpath('//div[@class="item"]')
  11. for item in items:
  12. title = item.xpath('.//span[@class="title"]/text()').extract_first()
  13. detail_page_url = item.xpath('./div[@class="pic"]/a/@href').extract_first()
  14. star = item.xpath('.//span[@class="rating_num"]/text()').extract_first()
  15. pic_url = item.xpath('./div[@class="pic"]/a/img/@src').extract_first()
  16. doubanItem['title'] = title
  17. doubanItem['detail_page_url'] = detail_page_url
  18. doubanItem['star'] = star
  19. doubanItem['pic_url'] = pic_url
  20. yield doubanItem
  21. next = response.xpath('//div[@class="paginator"]//span[@class="next"]/a/@href').extract_first()
  22. if next is not None:
  23. next = response.urljoin(next)
  24. yield scrapy.Request(next, callback=self.parse)

编写Pipline

这里,我们使用csv进行表格存取,创建的管道如下:

  1. # -*- coding: utf-8 -*-
  2. import csv
  3. class DoubanPipeline:
  4. def __init__(self):
  5. with open("videos.csv", "w", newline='', encoding='utf-8') as csvfile:
  6. writer = csv.writer(csvfile)
  7. writer.writerow(["电影名", "详情页", "豆瓣评分", "封面图片"])
  8. def process_item(self, item, spider):
  9. title = item['title']
  10. detail_page_url = item['detail_page_url']
  11. star = item['star']
  12. pic_url = item['pic_url']
  13. with open("videos.csv", "a", newline='', encoding='utf-8') as csvfile:
  14. writer = csv.writer(csvfile)
  15. writer.writerow([title, detail_page_url, star, pic_url])
  16. return item

当然,如果想要存为txt也很容易,稍微改写一下即可:

  1. # -*- coding: utf-8 -*-
  2. import codecs
  3. class DoubanPipeline:
  4. def __init__(self):
  5. with open("videos.txt", "w", newline='', encoding='utf-8') as f:
  6. f.write("{}\r\n".format("电影名, 详情页, 豆瓣评分, 封面图片"))
  7. def process_item(self, item, spider):
  8. title = item['title']
  9. detail_page_url = item['detail_page_url']
  10. star = item['star']
  11. pic_url = item['pic_url']
  12. txt = "{},{},{},{}".format(title, detail_page_url, star, pic_url)
  13. with codecs.open("videos.txt", 'a', encoding='utf-8') as f:
  14. f.write("{}\r\n".format(txt))
  15. return item

最后,别忘了在settings.py中加入这个管道:

  1. ...
  2. ITEM_PIPELINES = {
  3. 'douban.pipelines.DoubanPipeline': 300,
  4. }

启动爬虫

当然,直接这样爬取会报403,因为豆瓣网禁止了爬虫行为,我们需要在settings.py中配置用户代理,并禁止访问robots.txt文件:

  1. ...
  2. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  3. USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
  4. # Obey robots.txt rules
  5. ROBOTSTXT_OBEY = False

ok,开始爬取,执行:

  1. scrapy crawl top250

爬取结果如下:

csv文件:
📃 捕捉网页数据并保存为纯文本文件 - 图5
预览csv:
📃 捕捉网页数据并保存为纯文本文件 - 图6