https://github.com/scrapy/scrapy
https://github.com/scrapy/dirbot
教程
https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/media-pipeline.html
http://kaito-kidd.com/
es-py 教程
http://fingerchou.com/2017/08/12/elasticsearch-dsl-with-python-usage-1/
https://github.com/kaito-kidd/fast-translation
http://wsbs.sz.gov.cn/shenzhen/project/index?service=c
http://wsbs.sz.gov.cn/shenzhen/icity/project/itemlist?dept_id=007542689&type=all
1、使用pip安装:
pip install Scrapy
2、操作
2.1创建项目
scrapy startproject bfirepc
2.2 编辑代码
代码,保存在 bfirepc/spiders 目录下的 dmoz_spider.py 文件中
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__title__ = ''
__author__ = 'BfireLai'
__mtime__ = '2018/4/11'
"""
import scrapy
from ..items import BfirepcItem
class DmozSpider(scrapy.Spider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
headers = {
'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
'Host':'www.win4000.com',
'Referer':'http://www.win4000.com/meitu.html',
}
def start_requests(self):
urls = [
# 'http://www.win4000.com/meinv146050.html',
# 'http://www.win4000.com/meinv146045.html',
'http://www.win4000.com/meitu.html'
]
for url in urls:
yield scrapy.Request(url=url, headers=self.headers, callback=self.parse)
def parse(self, response):
#下载
# filename = response.url.split("/")[-1]
# with open(filename, 'wb') as f:
# f.write(response.body)
#获取数据
item = BfirepcItem()
a_space = response.xpath('//div[@class="tab_box"]/*/ul[@class="clearfix"]/li/a')
#print(a_space)
for a in a_space:
item['title'] = a.xpath('p/text()').extract()
item['link'] = a.xpath('img/@data-original').extract()
#print(item)
yield item
2.3、爬取
进入项目的根目录,执行下列命令启动spider:
scrapy crawl dmoz
2.4 学习XPath
XPath的基本语法:(更多可见:http://www.w3school.com.cn/xpath/)
首先XPath中的路径分为绝对路径与相对路径:
绝对路径:用/,表示从根节点开始选取;
相对路径:用//,表示选择任意位置的节点,而不考虑他们的位置;
另外可以使用*通配符来表示未知的元素;除此之外还有两个选取节点的:
.:选取当前节点;..:当前节点的父节点;
接着就是选择分支进行定位了,比如存在多个元素,想唯一定位,
可以使用[]中括号来选择分支,下标是从1开始算的哦!
比如可以有下面这些玩法:
/tr/td[1]:取第一个td
/tr/td[last()]:取最后一个td
/tr/td[last()-1]:取倒数第二个td
/tr/td[position()<3]:取第一个和第二个td
/tr/td[@class]:选取拥有class属性的td
/tr/td[@class='xxx']:选取拥有class属性为xxx的td
/tr/td[count>10]:选取 price 元素的值大于10的td
3、
3.存储数据
得到我们的结果啦,最简单的存储数据的方式就是使用Feed exports,
支持四种导出格式:JSON,JSON lines,XML和CSV
使用也很简单,只是在平时执行scrapy脚本的后面加点东西:
scrapy crawl spider名字 -o 导出文件名 -t 导出格式
1
比如我这里导出xml:
scrapy crawl dmoz -o pics.xml -t xml
4、下载图片
scrapy/bfirepc/bfirepc/settings.py
修改
ITEM_PIPELINES = {
# 'bfirepc.pipelines.BfirepcPipeline': 300,
'bfirepc.pipelines.BfirepcDownloadPipeline': 300,
}
IMAGES_STORE = 'F:\img'
修改
scrapy/bfirepc/bfirepc/pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from .items import BfirepcItem
from scrapy import Request
import hashlib
import re
class BfirepcPipeline(object):
def process_item(self, item, spider):
return item
class BfirepcDownloadPipeline(ImagesPipeline):
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate',
'accept-language': 'zh-CN,zh;q=0.9,und;q=0.8,en;q=0.7',
'cookie': 'Hm_lvt_d82cde71b7abae5cbfcb8d13c78b854c=1523436855',
'referer': 'http://pic1.win4000.com/pic',
'User-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
'Host': 'pic1.win4000.com',
}
def get_media_requests(self, item, info):
for url in item['link']:
#print(url)
self.headers['referer'] = url
yield Request(url, headers=self.headers)
def item_completed(self, results, item, info):
img_paths = [x['path'] for ok, x in results if ok]
if not img_paths:
raise DropItem("item cont no img")
item['paths'] = img_paths
return item