Scrapy从入门到放弃 - 《python从入门到放弃》

创建工程
创建爬虫文件
修改配置文件
用xpath解析数据
数据存储
管道
基于管道存储
日志打印等级

下载（Windows可能不成功）

pip install scrapy

创建工程

scrapy startproject MyScrapy

创建爬虫文件

进入到工程目录

scrapy genspider first www.baidu.com
-- 于是会在spiders目录下生成`first.py`的文件。
-- 最后输入的域名可以随意，之后可更改

first.py内容如下：

# -*- coding: utf-8 -*-
import scrapy
class FirstSpider(scrapy.Spider):
    name = 'first'        # 爬虫文件的唯一标识
    # ↓↓ 这是允许域名。一般直接注释就可以
    allowed_domains = ['www.baidu.com']
    # ↓↓ 起始的url列表：列表中的列表元素会被Scrapy自动的请求发送
    start_urls = ['http://www.baidu.com/']    # 列表中添加url
    # ↓↓ 解析数据
    def parse(self, response):
        pass

修改配置文件

不遵从robots
进行UA伪装

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'MyScrapy (+http://www.yourdomain.com)'
# 按照这行注释，找个UA贴上去    ↑↑
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False    # 都要爬了，怎么可能True

用xpath解析数据

class FirstSpider(scrapy.Spider):
    name = 'first'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://news.sina.com.cn/', 'https://www.sogou.com/']
    def parse(self, response):
        tag_list = response.xpath('//*[@id="syncad_1"]')
        for h1 in tag_list:
            # ↓↓ 注意：和普通xpath不一样，必须使用extract获取文本。否则只能获得Selector对象
            # 下面两种方式都可以
            title = h1.xpath('./h1/a/text()')[0].extract()
            title2 = h1.xpath('./h1/a/text()').extract_first()
            print(title2)

数据存储

基于终端指令
局限：只能将parse方法的返回值存储到磁盘文件中

scrapy crawl first -o file.csv        # 文件格式有要求，不能是txt之类

基于管道

管道

管道文件 MyScrapy/pipelines.py 初始的类：

class MyscrapyPipeline:
    def process_item(self, item, spider):
        return item

管道文件pipelines.py中的一个管道类，表示将数据存储到一种形式的平台中
每个管道类中的process_item只有return item，才能给下一个管道类传递item

基于管道存储

流程

数据解析

将解析的数据封装到 item类型的对象中

改写item.py

class MyscrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
content = scrapy.Field()    # Field是一个万能的数据类型

爬虫文件中：

from MyScrapy.items import MyscrapyItem
......
def parse(self, response):
   data = []
   tag_list = response.xpath('//*[@id="syncad_1"]')
   for h1 in tag_list:
       title = h1.xpath('./h1/a/text()').extract_first()
       item = MyscrapyItem()        # 创建这个对象
       item['content'] = title        # 将数据封装到这个对象中

注意：要用item['xx']，不能用.

将item对象提交给管道

......
  item = MyscrapyItem()    # 创建这个对象
  yield item  # 将item对象提交到管道

管道类中的process_item方法负责接收item对象，然后对item进行任意形式的持久化存储

如果使用文件形式存储——>pipelines.py：

class MyscrapyPipeline:
fp = None
def open_spider(self, spider):
   print('只在 爬虫开始时 调用一次')
   self.fp = open('./data.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
   content = item['content']    # 接收item对象
   self.fp.write(content)        # 存储到文件中
   return item                    # 每个管道类只有return item，才能给下一个管道类传递item
def close_spider(self, spider):
   print('只在 爬虫结束时 调用一次')
   self.fp.close()

在配置文件中开启管道 settings.py

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'MyScrapy.pipelines.MyscrapyPipeline': 300,  # 300表示优先级。越小优先级越高
}

定义新的管道类，让数据再存一份

日志打印等级

在settings.py中添加

LOG_LEVEL = 'ERROR'        # 只输出错误类型的日志