目的:我们制作一个爬虫,用来爬取豆瓣网评分比较高的电影,并保存到文件。
地址:https://movie.douban.com/top250
分析页面结构
主体结构分析
“下一页”按钮分析:
嵌套很明显了,很简单的页面结构,接下来就开始编写爬虫。
创建scrapy项目
scrapy startproject douban
cd douban
scrapy genspider top250 movie.douban.com/top250
创建好的项目结构如下:
打开Top250Spider.py
可以看到蜘蛛文件如下:
class Top250Spider(scrapy.Spider):
name = 'top250'
allowed_domains = ['movie.douban.com/top250']
start_urls = ['http://movie.douban.com/top250/']
def parse(self, response):
pass
爬取数据
根据上面的分析,我们很容易写出爬取主体数据的代码:
# -*- coding: utf-8 -*-
import scrapy
class Top250Spider(scrapy.Spider):
name = 'top250'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250/']
def parse(self, response):
items = response.xpath('//div[@class="item"]')
for item in items:
title = item.xpath('.//span[@class="title"]/text()').extract_first()
detail_page_url = item.xpath('./div[@class="pic"]/a/@href').extract_first()
star = item.xpath('.//span[@class="rating_num"]/text()').extract_first()
pic_url = item.xpath('./div[@class="pic"]/a/img/@src').extract_first()
我们分析了,页面中存在“下一页”按钮,逻辑是:当“下一页”按钮不存在时,停止爬取,若存在,继续解析:
# -*- coding: utf-8 -*-
import scrapy
class Top250Spider(scrapy.Spider):
name = 'top250'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250/']
def parse(self, response):
doubanItem = DoubanItem()
items = response.xpath('//div[@class="item"]')
for item in items:
title = item.xpath('.//span[@class="title"]/text()').extract_first()
detail_page_url = item.xpath('./div[@class="pic"]/a/@href').extract_first()
star = item.xpath('.//span[@class="rating_num"]/text()').extract_first()
pic_url = item.xpath('./div[@class="pic"]/a/img/@src').extract_first()
next = response.xpath('//div[@class="paginator"]//span[@class="next"]/a/@href').extract_first()
if next is not None:
next = response.urljoin(next)
yield scrapy.Request(next, callback=self.parse)
提取到Items
爬虫主体创建好了,我们需要将其有效信息提取到Item,编写items.py
文件如下:
# -*- coding: utf-8 -*-
import scrapy
class DoubanItem(scrapy.Item):
title = scrapy.Field()
detail_page_url = scrapy.Field()
star = scrapy.Field()
pic_url = scrapy.Field()
修改蜘蛛文件Top250Spider.py
:
# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
class Top250Spider(scrapy.Spider):
name = 'top250'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250/']
def parse(self, response):
doubanItem = DoubanItem()
items = response.xpath('//div[@class="item"]')
for item in items:
title = item.xpath('.//span[@class="title"]/text()').extract_first()
detail_page_url = item.xpath('./div[@class="pic"]/a/@href').extract_first()
star = item.xpath('.//span[@class="rating_num"]/text()').extract_first()
pic_url = item.xpath('./div[@class="pic"]/a/img/@src').extract_first()
doubanItem['title'] = title
doubanItem['detail_page_url'] = detail_page_url
doubanItem['star'] = star
doubanItem['pic_url'] = pic_url
yield doubanItem
next = response.xpath('//div[@class="paginator"]//span[@class="next"]/a/@href').extract_first()
if next is not None:
next = response.urljoin(next)
yield scrapy.Request(next, callback=self.parse)
编写Pipline
这里,我们使用csv
进行表格存取,创建的管道如下:
# -*- coding: utf-8 -*-
import csv
class DoubanPipeline:
def __init__(self):
with open("videos.csv", "w", newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["电影名", "详情页", "豆瓣评分", "封面图片"])
def process_item(self, item, spider):
title = item['title']
detail_page_url = item['detail_page_url']
star = item['star']
pic_url = item['pic_url']
with open("videos.csv", "a", newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([title, detail_page_url, star, pic_url])
return item
当然,如果想要存为txt也很容易,稍微改写一下即可:
# -*- coding: utf-8 -*-
import codecs
class DoubanPipeline:
def __init__(self):
with open("videos.txt", "w", newline='', encoding='utf-8') as f:
f.write("{}\r\n".format("电影名, 详情页, 豆瓣评分, 封面图片"))
def process_item(self, item, spider):
title = item['title']
detail_page_url = item['detail_page_url']
star = item['star']
pic_url = item['pic_url']
txt = "{},{},{},{}".format(title, detail_page_url, star, pic_url)
with codecs.open("videos.txt", 'a', encoding='utf-8') as f:
f.write("{}\r\n".format(txt))
return item
最后,别忘了在settings.py
中加入这个管道:
...
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
启动爬虫
当然,直接这样爬取会报403,因为豆瓣网禁止了爬虫行为,我们需要在settings.py
中配置用户代理,并禁止访问robots.txt
文件:
...
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ok,开始爬取,执行:
scrapy crawl top250
爬取结果如下:
csv文件:
预览csv: