安装和基本使用 - 图1

0 简介

Scrapy是一个快速、简单且可扩展的网络爬虫框架

官网:https://scrapy.org/
文档:https://scrapy.org/doc/

1 安装

pip install scrapy
注意:windows需要先安装VCForPython,然后 pip install ``pywin32 scrapy
linux可能需要安装python-devel, libffi-devel, libxslt-devel

2 基本使用

2.1 直接运行spider脚本

  1. cat > myspider.py <<EOF
  2. import scrapy
  3. class BlogSpider(scrapy.Spider):
  4. name = 'blogspider'
  5. start_urls = ['https://blog.scrapinghub.com']
  6. def parse(self, response):
  7. for title in response.css('.post-header>h2'):
  8. yield {'title': title.css('a ::text').get()}
  9. for next_page in response.css('a.next-posts-link'):
  10. yield response.follow(next_page, self.parse)
  11. EOF
  12. scrapy runspider myspider.py

2.2 建立项目,里面编写spider脚本

  1. scrapy startproject tutorial
  2. cd tutorial
  3. cat > spiders/quotes_spider.py <<EOF
  4. import scrapy
  5. class QuotesSpider(scrapy.Spider):
  6. name = "quotes"
  7. def start_requests(self):
  8. urls = [
  9. 'http://quotes.toscrape.com/page/1/',
  10. 'http://quotes.toscrape.com/page/2/',
  11. ]
  12. for url in urls:
  13. yield scrapy.Request(url=url, callback=self.parse)
  14. def parse(self, response):
  15. page = response.url.split("/")[-2]
  16. filename = 'quotes-%s.html' % page
  17. with open(filename, 'wb') as f:
  18. f.write(response.body)
  19. self.log('Saved file %s' % filename)
  20. EOF
  21. # 运行脚本
  22. scrapy crawl quotes

说明:去spiders中寻找name为quotes的爬虫,然后自动执行start_requests函数,发起Request请求,调用默认的回调函数parse。 如果Request请求时不需要传其他参数,也可使用start_urls作为入口来执行parse函数

https://www.jianshu.com/p/1a894fe4a93c