//TODO 修改成能用的

程序设计步骤

image.png

建立工程和Spider模板

  1. \> scrapy startproject stock_spider
  2. \> cd BaiduStocks
  3. \> scrapy genspider stocks baidu.com

进一步修改spiders/stocks.py文件

编写Spider

image.png

  • 修改前 ```python

    -- coding: utf-8 --

    import scrapy

class StocksSpider(scrapy.Spider): name = ‘stocks’ allowed_domains = [‘baidu.com’] start_urls = [‘http://baidu.com/‘]

  1. def parse(self, response):
  2. pass
  1. - 修改后
  2. ```python
  3. # -*- coding: utf-8 -*-
  4. import scrapy
  5. def parse(self, response):
  6. for href in response.css('a::attr(href)).extract():
  7. try:
  8. stock = re.findall(r"[s][hz]\d{6}", href)[0]
  9. url = 'https://gupiao/baidu.com/stock/' + stock + '.html'
  10. yield scrapy.Request(url, callback=parse_stock)
  11. except:
  12. continue
  13. def parse_stock(self, response):
  14. info_dict = {}
  15. stock_info = response.css('.stock-bets')
  16. name = stock_info.css('.bets-name').extract()[0]
  17. key_list = stock_info.css('dd').extract()
  18. value_list = stock_info.css('dd').extract()
  19. for i in range(len(key_list):
  20. key = re.findall(r'>.*</dt>', key_list[i])[0][1:-5]
  21. try:
  22. value = re.findall(r'\d+\.?.*</dd>', value_list[i])[0][0:-5]
  23. except:
  24. value = '--'
  25. info_dict[key] = value
  26. info_dict.update({'股票名称': re.findall('\s.*\(', name)[0].split()[0] +
  27. re.findall('\>.*\<', name)[0][1:-1]})
  28. yield info_dict

编写Pipelines

image.png
文件在stock_spider目录下

  • 修改前 ```python

    -- coding: utf-8 --

Define your item pipelines here

#

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class StockSpiderPipeline: def process_item(self, item, spider): return item

  1. - 修改后
  2. - 这里我们新建一个类,并在配置文件中添加配置,使它识别我们新编写的类
  3. ```python
  4. # -*- coding: utf-8 -*-
  5. # Define your item pipelines here
  6. #
  7. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  8. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  9. class StockSpiderPipeline:
  10. def process_item(self, item, spider):
  11. return item
  12. class StockInfoPipeline():
  13. def open_spider(self, spider):
  14. self.f = open('StockInfo.txt', 'w')
  15. def close_spider(self, spider):
  16. self.f.close()
  17. def process_item(self, item, spider):
  18. try:
  19. line = str(dict(item)) + '\n'
  20. self.f.write(line)
  21. except:
  22. pass
  23. return item
  • 修改配置文件

配置settings.py文件

  1. # Configure item pipelines
  2. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  3. ITEM_PIPELINES = {
  4. 'stock_spider.pipelines.StockInfoPipeline': 300,
  5. }

执行

  1. /> python -m scrapy crawl stocks

优化——改善性能

image.png