0 简介
Scrapy是一个快速、简单且可扩展的网络爬虫框架
官网:https://scrapy.org/
文档:https://scrapy.org/doc/
1 安装
pip install scrapy
注意:windows需要先安装VCForPython,然后 pip install ``pywin32 scrapy
linux可能需要安装python-devel, libffi-devel, libxslt-devel
2 基本使用
2.1 直接运行spider脚本
cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
EOF
scrapy runspider myspider.py
2.2 建立项目,里面编写spider脚本
scrapy startproject tutorial
cd tutorial
cat > spiders/quotes_spider.py <<EOF
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
EOF
# 运行脚本
scrapy crawl quotes
说明:去spiders中寻找name为quotes的爬虫,然后自动执行start_requests函数,发起Request请求,调用默认的回调函数parse。 如果Request请求时不需要传其他参数,也可使用start_urls作为入口来执行parse函数