第一个实例——产生Scrapy爬虫

  • 应用Scrapy爬虫框架主要是编写配置型代码

HTML页面地址:http://python123.io/ws/demo.html

创建工程目录

  • 在命令行中进行操作 ```powershell

    python -m scrapy startproject python123demo

New Scrapy project ‘python123demo’, using template directory ‘E:\Environment\anaconda\lib\site-packages\scrapy\templates\project’, created in: D:\Learning\CS\Py Learning\爬虫\week4-scrapy\python123demo

You can start your first spider with: cd python123demo scrapy genspider example example.com

cd python123demo ls

Mode LastWriteTime Length Name


d——- 2020/5/14 9:06 python123demo -a—— 2020/5/14 9:06 269 scrapy.cfg

  1. <a name="thAMS"></a>
  2. ### 工程目录内容
  3. ![image.png](https://cdn.nlark.com/yuque/0/2020/png/805730/1589418501984-a77016cb-dc68-4cb2-966b-d7ccf47442ec.png#align=left&display=inline&height=334&margin=%5Bobject%20Object%5D&name=image.png&originHeight=562&originWidth=1256&size=188908&status=done&style=none&width=746)<br />![image.png](https://cdn.nlark.com/yuque/0/2020/png/805730/1589418535666-64ac35b4-0e5d-4941-824d-8064457e0599.png#align=left&display=inline&height=209&margin=%5Bobject%20Object%5D&name=image.png&originHeight=305&originWidth=1091&size=87315&status=done&style=none&width=746)
  4. <a name="vbYge"></a>
  5. ## 在工程中产生一个爬虫
  6. ```powershell
  7. >>> python -m scrapy genspider demo python123.io

该命令的作用:

  • 生成一个名称为demospider
  • spiders目录下增加代码文件demo.py

该命令仅用于生成demo.py,该文件也可以手工生成
【demo.py文件内容】

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class DemoSpider(scrapy.Spider):
  4. name = 'demo'
  5. allowed_domains = ['python123.io']
  6. start_urls = ['http://python123.io/']
  7. def parse(self, response):
  8. pass
  • 修改demo.py文件 ```python

    -- coding: utf-8 --

    import scrapy

class DemoSpider(scrapy.Spider): name = ‘demo’

  1. # allowed_domains = ['python123.io']
  2. start_urls = ['http://python123.io/ws/demo.html']
  3. def parse(self, response):
  4. file_path = response.url.split('/')[-1]
  5. with open(file_path, 'wb') as f:
  6. f.write(response.body)
  7. self.log('Saved file &s.' % name)
  8. pass
  1. - 执行
  2. ```powershell
  3. >>> python -m scrapy crawl demo

spider目录下产生了爬取到的HTML页面
image.png

  • demo.py代码的完整版本

image.png

  • 两个等价版本的区别

image.png

yield关键字

image.png

  • yield 关键字经常与 for 循环搭配使用
    1. >>> def gen(n):
    2. for i in range(n):
    3. yield i ** 2
    4. >>> for i in gen(5):
    5. print(i, " " , end='')
    6. 0 1 4 9 16
    实现上面代码的功能也可以用下面形式
    1. >>> def gen(n):
    2. lst = [i**2 for i in range(n)]
    3. return lst
    4. >>> for i in gen(n):
    5. print(i, ' ', end='')
    6. 0 1 4 9 16
    image.png

    Scrapy爬虫的基本使用

    使用步骤

    image.png

    数据类型

    image.png

    Request类

    image.png
    image.png

    Resonpse

    image.png
    image.png

    Item类

    image.png

    Scrapy提取信息的方法

    image.png

    CSS Selector

    image.png