1. https://github.com/scrapy/scrapy
    2. https://github.com/scrapy/dirbot
    3. 教程
    4. https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/media-pipeline.html
    5. http://kaito-kidd.com/
    6. es-py 教程
    7. http://fingerchou.com/2017/08/12/elasticsearch-dsl-with-python-usage-1/
    8. https://github.com/kaito-kidd/fast-translation
    9. http://wsbs.sz.gov.cn/shenzhen/project/index?service=c
    10. http://wsbs.sz.gov.cn/shenzhen/icity/project/itemlist?dept_id=007542689&type=all
    11. 1、使用pip安装:
    12. pip install Scrapy
    13. 2、操作
    14. 2.1创建项目
    15. scrapy startproject bfirepc
    16. 2.2 编辑代码
    17. 代码,保存在 bfirepc/spiders 目录下的 dmoz_spider.py 文件中
    18. #!/usr/bin/env python
    19. # -*- coding: utf-8 -*-
    20. """
    21. __title__ = ''
    22. __author__ = 'BfireLai'
    23. __mtime__ = '2018/4/11'
    24. """
    25. import scrapy
    26. from ..items import BfirepcItem
    27. class DmozSpider(scrapy.Spider):
    28. name = 'dmoz'
    29. allowed_domains = ['dmoz.org']
    30. headers = {
    31. 'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
    32. 'Host':'www.win4000.com',
    33. 'Referer':'http://www.win4000.com/meitu.html',
    34. }
    35. def start_requests(self):
    36. urls = [
    37. # 'http://www.win4000.com/meinv146050.html',
    38. # 'http://www.win4000.com/meinv146045.html',
    39. 'http://www.win4000.com/meitu.html'
    40. ]
    41. for url in urls:
    42. yield scrapy.Request(url=url, headers=self.headers, callback=self.parse)
    43. def parse(self, response):
    44. #下载
    45. # filename = response.url.split("/")[-1]
    46. # with open(filename, 'wb') as f:
    47. # f.write(response.body)
    48. #获取数据
    49. item = BfirepcItem()
    50. a_space = response.xpath('//div[@class="tab_box"]/*/ul[@class="clearfix"]/li/a')
    51. #print(a_space)
    52. for a in a_space:
    53. item['title'] = a.xpath('p/text()').extract()
    54. item['link'] = a.xpath('img/@data-original').extract()
    55. #print(item)
    56. yield item
    57. 2.3、爬取
    58. 进入项目的根目录,执行下列命令启动spider:
    59. scrapy crawl dmoz
    60. 2.4 学习XPath
    61. XPath的基本语法:(更多可见:http://www.w3school.com.cn/xpath/)
    62. 首先XPath中的路径分为绝对路径与相对路径:
    63. 绝对路径:用/,表示从根节点开始选取;
    64. 相对路径:用//,表示选择任意位置的节点,而不考虑他们的位置;
    65. 另外可以使用*通配符来表示未知的元素;除此之外还有两个选取节点的:
    66. .:选取当前节点;..:当前节点的父节点;
    67. 接着就是选择分支进行定位了,比如存在多个元素,想唯一定位,
    68. 可以使用[]中括号来选择分支,下标是从1开始算的哦!
    69. 比如可以有下面这些玩法:
    70. /tr/td[1]:取第一个td
    71. /tr/td[last()]:取最后一个td
    72. /tr/td[last()-1]:取倒数第二个td
    73. /tr/td[position()<3]:取第一个和第二个td
    74. /tr/td[@class]:选取拥有class属性的td
    75. /tr/td[@class='xxx']:选取拥有class属性为xxxtd
    76. /tr/td[count>10]:选取 price 元素的值大于10td
    77. 3
    78. 3.存储数据
    79. 得到我们的结果啦,最简单的存储数据的方式就是使用Feed exports
    80. 支持四种导出格式:JSONJSON linesXMLCSV
    81. 使用也很简单,只是在平时执行scrapy脚本的后面加点东西:
    82. scrapy crawl spider名字 -o 导出文件名 -t 导出格式
    83. 1
    84. 比如我这里导出xml
    85. scrapy crawl dmoz -o pics.xml -t xml
    86. 4、下载图片
    87. scrapy/bfirepc/bfirepc/settings.py
    88. 修改
    89. ITEM_PIPELINES = {
    90. # 'bfirepc.pipelines.BfirepcPipeline': 300,
    91. 'bfirepc.pipelines.BfirepcDownloadPipeline': 300,
    92. }
    93. IMAGES_STORE = 'F:\img'
    94. 修改
    95. scrapy/bfirepc/bfirepc/pipelines.py
    96. # -*- coding: utf-8 -*-
    97. # Define your item pipelines here
    98. #
    99. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    100. # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    101. from scrapy.pipelines.images import ImagesPipeline
    102. from scrapy.exceptions import DropItem
    103. from .items import BfirepcItem
    104. from scrapy import Request
    105. import hashlib
    106. import re
    107. class BfirepcPipeline(object):
    108. def process_item(self, item, spider):
    109. return item
    110. class BfirepcDownloadPipeline(ImagesPipeline):
    111. headers = {
    112. 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    113. 'accept-encoding': 'gzip, deflate',
    114. 'accept-language': 'zh-CN,zh;q=0.9,und;q=0.8,en;q=0.7',
    115. 'cookie': 'Hm_lvt_d82cde71b7abae5cbfcb8d13c78b854c=1523436855',
    116. 'referer': 'http://pic1.win4000.com/pic',
    117. 'User-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
    118. 'Host': 'pic1.win4000.com',
    119. }
    120. def get_media_requests(self, item, info):
    121. for url in item['link']:
    122. #print(url)
    123. self.headers['referer'] = url
    124. yield Request(url, headers=self.headers)
    125. def item_completed(self, results, item, info):
    126. img_paths = [x['path'] for ok, x in results if ok]
    127. if not img_paths:
    128. raise DropItem("item cont no img")
    129. item['paths'] = img_paths
    130. return item