requests

上一节中,我们了解了 urllib 的基本用法,但是其中确实有不方便的地方,比如处理网页验证和 Cookies 时,需要写 Opener 和 Handler 来处理。为了更加方便地实现这些操作,就有了更为强大的库 requests,有了它,Cookies、登录验证、代理设置等操作都不是事儿。

安装环境

  1. pip install requests

官方地址:http://docs.python-requests.org/zh_CN

1. 实例引入

urllib 库中的 urlopen 方法实际上是以 GET 方式请求网页,而 requests 中相应的方法就是 get 方法,是不是感觉表达更明确一些?下面通过实例来看一下:

  1. import requests
  2. r = requests.get('https://www.baidu.com/')
  3. print(type(r))
  4. print(r.status_code)
  5. print(type(r.text))
  6. print(r.text)
  7. print(r.cookies)

测试实例:

  1. r = requests.post('http://httpbin.org/post')
  2. r = requests.put('http://httpbin.org/put')
  3. r = requests.delete('http://httpbin.org/delete')
  4. r = requests.head('http://httpbin.org/get')
  5. r = requests.options('http://httpbin.org/get')

2、GET获取参数案例

  1. import requests
  2. data = {
  3. 'name': 'germey',
  4. 'age': 22
  5. }
  6. r = requests.get("http://httpbin.org/get", params=data)
  7. print(r.text)

2.1 抓取二进制数据

下面以 图片为例来看一下:

  1. import requests
  2. r = requests.get("http://qwmxpxq5y.hn-bkt.clouddn.com/hh.png")
  3. print(r.text)
  4. print(r.content)

如果不传递 headers,就不能正常请求:

  1. import requests
  2. r = requests.get("https://mmzztt.com/")
  3. print(r.text)

但如果加上 headers 并加上 User-Agent 信息,那就没问题了:

  1. import requests
  2. headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
  3. }
  4. r = requests.get("https://mmzztt.com/, headers=headers)
  5. print(r.text)

3、post请求

3.1 —前面我们了解了最基本的 GET 请求,另外一种比较常见的请求方式是 POST。使用 requests 实现 POST 请求同样非常简单,示例如下:

  1. import requests
  2. data = {'name': 'germey', 'age': '22'}
  3. r = requests.post("http://httpbin.org/post", data=data)
  4. print(r.text)

测试网站

  • 巨潮网络数据
  1. import requests
  2. url= 'http://www.cninfo.com.cn/data20/ints/statistics'
  3. res = requests.post(url)
  4. print(res.text)

3.2 —发送请求后,得到的自然就是响应。在上面的实例中,我们使用 text 和 content 获取了响应的内容。此外,还有很多属性和方法可以用来获取其他信息,比如状态码、响应头、Cookies 等。示例如下:

  1. import requests
  2. headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
  3. }
  4. r = requests.get('http://www.jianshu.com',headers=headers)
  5. print(type(r.status_code), r.status_code)
  6. print(type(r.headers), r.headers)
  7. print(type(r.cookies), r.cookies)
  8. print(type(r.url), r.url)
  9. print(type(r.history), r.history)

3.3 —状态码常用来判断请求是否成功,而 requests 还提供了一个内置的状态码查询对象 requests.codes,示例如下:

  1. import requests
  2. headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
  3. }
  4. r = requests.get('http://www.jianshu.com',headers=headers)
  5. exit() if not r.status_code == requests.codes.ok else print('Request Successfully')

3.4 —那么,肯定不能只有 ok 这个条件码。下面列出了返回码和相应的查询条件:

  1. # 信息性状态码
  2. 100: ('continue',),
  3. 101: ('switching_protocols',),
  4. 102: ('processing',),
  5. 103: ('checkpoint',),
  6. 122: ('uri_too_long', 'request_uri_too_long'),
  7. # 成功状态码
  8. 200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
  9. 201: ('created',),
  10. 202: ('accepted',),
  11. 203: ('non_authoritative_info', 'non_authoritative_information'),
  12. 204: ('no_content',),
  13. 205: ('reset_content', 'reset'),
  14. 206: ('partial_content', 'partial'),
  15. 207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
  16. 208: ('already_reported',),
  17. 226: ('im_used',),
  18. # 重定向状态码
  19. 300: ('multiple_choices',),
  20. 301: ('moved_permanently', 'moved', '\\o-'),
  21. 302: ('found',),
  22. 303: ('see_other', 'other'),
  23. 304: ('not_modified',),
  24. 305: ('use_proxy',),
  25. 306: ('switch_proxy',),
  26. 307: ('temporary_redirect', 'temporary_moved', 'temporary'),
  27. 308: ('permanent_redirect',
  28. 'resume_incomplete', 'resume',), # These 2 to be removed in 3.0
  29. # 客户端错误状态码
  30. 400: ('bad_request', 'bad'),
  31. 401: ('unauthorized',),
  32. 402: ('payment_required', 'payment'),
  33. 403: ('forbidden',),
  34. 404: ('not_found', '-o-'),
  35. 405: ('method_not_allowed', 'not_allowed'),
  36. 406: ('not_acceptable',),
  37. 407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
  38. 408: ('request_timeout', 'timeout'),
  39. 409: ('conflict',),
  40. 410: ('gone',),
  41. 411: ('length_required',),
  42. 412: ('precondition_failed', 'precondition'),
  43. 413: ('request_entity_too_large',),
  44. 414: ('request_uri_too_large',),
  45. 415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
  46. 416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
  47. 417: ('expectation_failed',),
  48. 418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
  49. 421: ('misdirected_request',),
  50. 422: ('unprocessable_entity', 'unprocessable'),
  51. 423: ('locked',),
  52. 424: ('failed_dependency', 'dependency'),
  53. 425: ('unordered_collection', 'unordered'),
  54. 426: ('upgrade_required', 'upgrade'),
  55. 428: ('precondition_required', 'precondition'),
  56. 429: ('too_many_requests', 'too_many'),
  57. 431: ('header_fields_too_large', 'fields_too_large'),
  58. 444: ('no_response', 'none'),
  59. 449: ('retry_with', 'retry'),
  60. 450: ('blocked_by_windows_parental_controls', 'parental_controls'),
  61. 451: ('unavailable_for_legal_reasons', 'legal_reasons'),
  62. 499: ('client_closed_request',),
  63. # 服务端错误状态码
  64. 500: ('internal_server_error', 'server_error', '/o\\', '✗'),
  65. 501: ('not_implemented',),
  66. 502: ('bad_gateway',),
  67. 503: ('service_unavailable', 'unavailable'),
  68. 504: ('gateway_timeout',),
  69. 505: ('http_version_not_supported', 'http_version'),
  70. 506: ('variant_also_negotiates',),
  71. 507: ('insufficient_storage',),
  72. 509: ('bandwidth_limit_exceeded', 'bandwidth'),
  73. 510: ('not_extended',),
  74. 511: ('network_authentication_required', 'network_auth', 'network_authentication')

4、高级用法

1、代理添加

  1. proxy = {
  2. 'http' : 'http://183.162.171.78:4216',
  3. }
  4. # 返回当前IP
  5. res = requests.get('http://httpbin.org/ip',proxies=proxy)
  6. print(res.text)

2、快代理IP使用

文献:https://www.kuaidaili.com/doc/dev/quickstart/

打开后,默认http协议,返回格式选json,我的订单是VIP订单,所以稳定性选稳定,返回格式选json,然后点击生成链接,下面的API链接直接复制上。

day3-HHTP请求模块 - 图1

3.关闭警告

  1. from requests.packages import urllib3
  2. urllib3.disable_warnings()

爬虫流程

5、初级爬虫

  1. # encoding: utf-8
  2. """
  3. @author: 夏洛
  4. @QQ: 1972386194
  5. @file: 初级.py
  6. """
  7. # https://36kr.com/information/technology
  8. import requests
  9. from lxml import etree
  10. def main():
  11. # 1. 定义页面URL和解析规则
  12. crawl_urls = [
  13. 'https://36kr.com/p/1328468833360133',
  14. 'https://36kr.com/p/1328528129988866',
  15. 'https://36kr.com/p/1328512085344642'
  16. ]
  17. parse_rule = "//h1[contains(@class,'article-title margin-bottom-20 common-width')]/text()"
  18. for url in crawl_urls:
  19. # 2. 发起HTTP请求
  20. response = requests.get(url)
  21. # 3. 解析HTML
  22. result = etree.HTML(response.text).xpath(parse_rule)[0]
  23. # 4. 保存结果
  24. print(result)
  25. if __name__ == '__main__':
  26. main()

6、全站采集

6.1 封装公共文件

  1. # from retrying import retry
  2. import requests
  3. from retrying import retry
  4. from requests.packages.urllib3.exceptions import InsecureRequestWarning
  5. requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
  6. from lxml import etree
  7. import random,time
  8. #https://diag.qichacha.com/ 浏览器信息
  9. class FakeChromeUA:
  10. first_num = random.randint(55, 62)
  11. third_num = random.randint(0, 3200)
  12. fourth_num = random.randint(0, 140)
  13. os_type = [
  14. '(Windows NT 6.1; WOW64)', '(Windows NT 10.0; WOW64)', '(X11; Linux x86_64)','(Macintosh; Intel Mac OS X 10_12_6)'
  15. ]
  16. chrome_version = 'Chrome/{}.0.{}.{}'.format(first_num, third_num, fourth_num)
  17. @classmethod
  18. def get_ua(cls):
  19. return ' '.join(['Mozilla/5.0', random.choice(cls.os_type), 'AppleWebKit/537.36','(KHTML, like Gecko)', cls.chrome_version, 'Safari/537.36'])
  20. class Spiders(FakeChromeUA):
  21. urls = []
  22. @retry(stop_max_attempt_number=3, wait_fixed=2000)
  23. def fetch(self, url, param=None,headers=None):
  24. try:
  25. if not headers:
  26. headers ={}
  27. headers['user-agent'] = self.get_ua()
  28. else:
  29. headers['user-agent'] = self.get_ua()
  30. self.wait_some_time()
  31. response = requests.get(url, params=param,headers=headers)
  32. if response.status_code == 200:
  33. response.encoding = 'utf-8'
  34. return response
  35. except requests.ConnectionError:
  36. return
  37. def wait_some_time(self):
  38. time.sleep(random.randint(100, 300) / 1000)

6.2 案例实践

  1. # encoding: utf-8
  2. """
  3. @author: 夏洛
  4. @QQ: 1972386194
  5. @file: 征战.py
  6. """
  7. # encoding: utf-8
  8. from urllib.parse import urljoin
  9. """整站爬虫"""
  10. import requests
  11. from lxml import etree
  12. from queue import Queue
  13. from xl.base import Spiders
  14. from pymongo import MongoClient
  15. flt = lambda x :x[0] if x else None
  16. class Crawl(Spiders):
  17. base_url = 'https://36kr.com/'
  18. # 种子URL
  19. start_url = 'https://36kr.com/information/technology'
  20. # 解析规则
  21. rules = {
  22. # 文章列表
  23. 'list_urls': '//div[@class="article-item-pic-wrapper"]/a/@href',
  24. # 详情页数据
  25. 'detail_urls': '//div[@class="common-width margin-bottom-20"]//text()',
  26. # 标题
  27. 'title': '//h1[@class="article-title margin-bottom-20 common-width"]/text()',
  28. }
  29. # 定义队列
  30. list_queue = Queue()
  31. def crawl(self, url):
  32. """首页"""
  33. response =self.fetch(url)
  34. list_urls = etree.HTML(response.text).xpath(self.rules['list_urls'])
  35. # print(urljoin(self.base_url, list_urls))
  36. for list_url in list_urls:
  37. # print(urljoin(self.base_url, list_url)) # 获取url 列表信息
  38. self.list_queue.put(urljoin(self.base_url, list_url))
  39. def list_loop(self):
  40. """采集列表页"""
  41. while True:
  42. list_url = self.list_queue.get()
  43. print(self.list_queue.qsize())
  44. self.crawl_detail(list_url)
  45. # 如果队列为空 退出程序
  46. if self.list_queue.empty():
  47. break
  48. def crawl_detail(self,url):
  49. '''详情页'''
  50. response = self.fetch(url)
  51. html = etree.HTML(response.text)
  52. content = html.xpath(self.rules['detail_urls'])
  53. title = flt(html.xpath(self.rules['title']))
  54. print(title)
  55. data = {
  56. 'content':content,
  57. 'title':title
  58. }
  59. self.save_mongo(data)
  60. def save_mongo(self,data):
  61. client = MongoClient() # 建立连接
  62. col = client['python']['hh']
  63. if isinstance(data, dict):
  64. res = col.insert_one(data)
  65. return res
  66. else:
  67. return '单条数据必须是这种格式:{"name":"age"},你传入的是%s' % type(data)
  68. def main(self):
  69. # 1. 标签页
  70. self.crawl(self.start_url)
  71. self.list_loop()
  72. if __name__ == '__main__':
  73. s = Crawl()
  74. s.main()

requests-cache

  1. pip install requests-cache

在做爬虫的时候,我们往往可能这些情况:

  • 网站比较复杂,会碰到很多重复请求。
  • 有时候爬虫意外中断了,但我们没有保存爬取状态,再次运行就需要重新爬取。

测试样例对比

  1. import requests
  2. import time
  3. start = time.time()
  4. session = requests.Session()
  5. for i in range(10):
  6. session.get('http://httpbin.org/delay/1')
  7. print(f'Finished {i + 1} requests')
  8. end = time.time()
  9. print('Cost time', end - start)

测试样例对比2

  1. import requests_cache
  2. import time
  3. start = time.time()
  4. session = requests_cache.CachedSession('demo_cache')
  5. for i in range(10):
  6. session.get('http://httpbin.org/delay/1')
  7. print(f'Finished {i + 1} requests')
  8. end = time.time()
  9. print('Cost time', end - start)

但是,刚才我们在写的时候把 requests 的 session 对象直接替换了。有没有别的写法呢?比如我不影响当前代码,只在代码前面加几行初始化代码就完成 requests-cache 的配置呢?

  1. import time
  2. import requests
  3. import requests_cache
  4. requests_cache.install_cache('demo_cache')
  5. start = time.time()
  6. session = requests.Session()
  7. for i in range(10):
  8. session.get('http://httpbin.org/delay/1')
  9. print(f'Finished {i + 1} requests')
  10. end = time.time()
  11. print('Cost time', end - start)

这次我们直接调用了 requests-cache 库的 install_cache 方法就好了,其他的 requests 的 Session 照常使用即可。

刚才我们知道了,requests-cache 默认使用了 SQLite 作为缓存对象,那这个能不能换啊?比如用文件,或者其他的数据库呢?

自然是可以的。

比如我们可以把后端换成本地文件,那可以这么做:

  1. requests_cache.install_cache('demo_cache', backend='filesystem')

如果不想生产文件,可以指定系统缓存文件

  1. requests_cache.install_cache('demo_cache', backend='filesystem', use_cache_dir=True)

另外除了文件系统,requests-cache 也支持其他的后端,比如 Redis、MongoDB、GridFS 甚至内存,但也需要对应的依赖库支持,具体可以参见下表:

Backend Class Alias Dependencies
SQLite SQLiteCache 'sqlite'
Redis RedisCache 'redis' redis-py
MongoDB MongoCache 'mongodb' pymongo
GridFS GridFSCache 'gridfs' pymongo
DynamoDB DynamoDbCache 'dynamodb' boto3
Filesystem FileCache 'filesystem'
Memory BaseCache 'memory'

比如使用 Redis 就可以改写如下:

  1. backend = requests_cache.RedisCache(host='localhost', port=6379)
  2. requests_cache.install_cache('demo_cache', backend=backend)

更多详细配置可以参考官方文档:https://requests-cache.readthedocs.io/en/stable/user_guide/backends.html#backends

当然,我们有时候也想指定有些请求不缓存,比如只缓存 POST 请求,不缓存 GET 请求,那可以这样来配置:

  1. import time
  2. import requests
  3. import requests_cache
  4. requests_cache.install_cache('demo_cache2', allowable_methods=['POST'])
  5. start = time.time()
  6. session = requests.Session()
  7. for i in range(10):
  8. session.get('http://httpbin.org/delay/1')
  9. print(f'Finished {i + 1} requests')
  10. end = time.time()
  11. print('Cost time for get', end - start)
  12. start = time.time()
  13. for i in range(10):
  14. session.post('http://httpbin.org/delay/1')
  15. print(f'Finished {i + 1} requests')
  16. end = time.time()
  17. print('Cost time for post', end - start)

当然我们还可以匹配 URL,比如针对哪种 Pattern 的 URL 缓存多久,则可以这样写:

  1. urls_expire_after = {'*.site_1.com': 30, 'site_2.com/static': -1}
  2. requests_cache.install_cache('demo_cache2', urls_expire_after=urls_expire_after)

好了,到现在为止,一些基本配置、过期时间配置、后端配置、过滤器配置等基本常见的用法就介绍到这里啦,更多详细的用法大家可以参考官方文档:https://requests-cache.readthedocs.io/en/stable/user_guide.html。