https://github.com/scrapy/scrapyhttps://github.com/scrapy/dirbot教程https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/media-pipeline.htmlhttp://kaito-kidd.com/es-py 教程http://fingerchou.com/2017/08/12/elasticsearch-dsl-with-python-usage-1/https://github.com/kaito-kidd/fast-translationhttp://wsbs.sz.gov.cn/shenzhen/project/index?service=chttp://wsbs.sz.gov.cn/shenzhen/icity/project/itemlist?dept_id=007542689&type=all 1、使用pip安装:pip install Scrapy 2、操作2.1创建项目scrapy startproject bfirepc2.2 编辑代码代码,保存在 bfirepc/spiders 目录下的 dmoz_spider.py 文件中#!/usr/bin/env python# -*- coding: utf-8 -*-"""__title__ = ''__author__ = 'BfireLai'__mtime__ = '2018/4/11'"""import scrapyfrom ..items import BfirepcItemclass DmozSpider(scrapy.Spider): name = 'dmoz' allowed_domains = ['dmoz.org'] headers = { 'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'Host':'www.win4000.com', 'Referer':'http://www.win4000.com/meitu.html', } def start_requests(self): urls = [# 'http://www.win4000.com/meinv146050.html',# 'http://www.win4000.com/meinv146045.html', 'http://www.win4000.com/meitu.html' ] for url in urls: yield scrapy.Request(url=url, headers=self.headers, callback=self.parse) def parse(self, response):#下载# filename = response.url.split("/")[-1]# with open(filename, 'wb') as f:# f.write(response.body)#获取数据 item = BfirepcItem() a_space = response.xpath('//div[@class="tab_box"]/*/ul[@class="clearfix"]/li/a') #print(a_space) for a in a_space: item['title'] = a.xpath('p/text()').extract() item['link'] = a.xpath('img/@data-original').extract() #print(item) yield item2.3、爬取进入项目的根目录,执行下列命令启动spider:scrapy crawl dmoz2.4 学习XPathXPath的基本语法:(更多可见:http://www.w3school.com.cn/xpath/)首先XPath中的路径分为绝对路径与相对路径: 绝对路径:用/,表示从根节点开始选取; 相对路径:用//,表示选择任意位置的节点,而不考虑他们的位置; 另外可以使用*通配符来表示未知的元素;除此之外还有两个选取节点的: .:选取当前节点;..:当前节点的父节点;接着就是选择分支进行定位了,比如存在多个元素,想唯一定位, 可以使用[]中括号来选择分支,下标是从1开始算的哦! 比如可以有下面这些玩法:/tr/td[1]:取第一个td/tr/td[last()]:取最后一个td/tr/td[last()-1]:取倒数第二个td/tr/td[position()<3]:取第一个和第二个td/tr/td[@class]:选取拥有class属性的td/tr/td[@class='xxx']:选取拥有class属性为xxx的td/tr/td[count>10]:选取 price 元素的值大于10的td3、3.存储数据得到我们的结果啦,最简单的存储数据的方式就是使用Feed exports, 支持四种导出格式:JSON,JSON lines,XML和CSV 使用也很简单,只是在平时执行scrapy脚本的后面加点东西:scrapy crawl spider名字 -o 导出文件名 -t 导出格式1比如我这里导出xml:scrapy crawl dmoz -o pics.xml -t xml4、下载图片scrapy/bfirepc/bfirepc/settings.py修改ITEM_PIPELINES = {# 'bfirepc.pipelines.BfirepcPipeline': 300, 'bfirepc.pipelines.BfirepcDownloadPipeline': 300,}IMAGES_STORE = 'F:\img'修改scrapy/bfirepc/bfirepc/pipelines.py# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.pipelines.images import ImagesPipelinefrom scrapy.exceptions import DropItemfrom .items import BfirepcItemfrom scrapy import Requestimport hashlibimport reclass BfirepcPipeline(object): def process_item(self, item, spider): return itemclass BfirepcDownloadPipeline(ImagesPipeline): headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'accept-encoding': 'gzip, deflate', 'accept-language': 'zh-CN,zh;q=0.9,und;q=0.8,en;q=0.7', 'cookie': 'Hm_lvt_d82cde71b7abae5cbfcb8d13c78b854c=1523436855', 'referer': 'http://pic1.win4000.com/pic', 'User-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'Host': 'pic1.win4000.com', } def get_media_requests(self, item, info): for url in item['link']: #print(url) self.headers['referer'] = url yield Request(url, headers=self.headers) def item_completed(self, results, item, info): img_paths = [x['path'] for ok, x in results if ok] if not img_paths: raise DropItem("item cont no img") item['paths'] = img_paths return item