一. 课程安排
- 课程内容
- loggin模块的使用
- 腾讯爬虫案例
- items的使用和介绍
- 阳光政务平台案例
二. 课堂笔记
1. loggin模块的使用
import scrapyimport logginglogger = logging.getLogger(__name__)class QbSpider(scrapy.Spider):name = 'qb'allowed_domains = ['qiushibaike.com']start_urls = ['http://qiushibaike.com/']def parse(self, response):for i in range(10):item = {}item['content'] = "haha"# logging.warning(item)logger.warning(item)yield item
pipeline文件
import logginglogger = logging.getLogger(__name__)class MyspiderPipeline(object):def process_item(self, item, spider):# print(item)logger.warning(item)item['hello'] = 'world'return item
保存到本地,在setting文件中LOG_FILE = './log.log'
basicConfig样式设置
https://www.cnblogs.com/felixzh/p/6072417.html
回顾

如何翻页

2. 腾讯爬虫案例
通过爬取腾讯招聘的页面的招聘信息,学习如何实现翻页请求
http://hr.tencent.com/position.php
创建项目scrapy startproject tencent创建爬虫scrapy genspider hr tencent.com
2.1 scrapy.Request知识点
scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None, encoding='utf-8', priority=0,dont_filter=False, errback=None, flags=None)常用参数为:callback:指定传入的URL交给那个解析函数去处理meta:实现不同的解析函数中传递数据,meta默认会携带部分信息,比如下载延迟,请求深度dont_filter:让scrapy的去重不会过滤当前URL,scrapy默认有URL去重功能,对需要重复请求的URL有重要用途
items.pyimport scrapyclass TencentItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()position = scrapy.Field()date = scrapy.Field()
4. 阳光政务平台案例
http://wz.sun0769.com/index.php/question/questionType?type=4&page=0
