1.爬虫技术怎么学

Screenshot_2021-05-28-19-28-20-524_tv.danmaku.bili.jpg

2.http和https基础知识

python爬虫入门教程 - 图9

python爬虫入门教程 - 图10

python爬虫入门教程 - 图11

python爬虫入门教程 - 图12

3.requests模块入门

中文文档: https://docs.python-requests.org/zh_CN/latest/

Screenshot_2021-05-28-23-12-51-704_tv.danmaku.bili.jpg

requests模块支持的http方法

  • get\post\pull\delete\head\trace\options\connect

Screenshot_2021-05-28-23-15-33-125_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-15-57-904_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-16-26-642_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-16-39-083_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-15-50-329_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-16-59-383_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-17-16-127_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-17-33-498_tv.danmaku.bili.jpg

get方法

  1. import requests
  2. response = requests.get(url='http://httpbin.org/ip')
  3. print(response.text)
  4. # 输出结果:
  5. # {
  6. # "origin": "211.142.27.39"
  7. # }
  8. print(response.content)
  9. # 输出结果:b'{\n "origin": "211.142.27.39"\n}\n'

Screenshot_2021-05-28-23-18-36-375_tv.danmaku.bili.jpg
Screenshot_2021-05-28-23-21-48-075_tv.danmaku.bili.jpg
Screenshot_2021-05-28-23-19-04-660_tv.danmaku.bili.jpg

post方法

  1. import requests
  2. response = requests.get(url='http://httpbin.org/post',data={'name':'baidu'})
  3. print(response.text)
  4. # 输出结果:(不知道什么原因)
  5. # <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
  6. # <title>405 Method Not Allowed</title>
  7. # <h1>Method Not Allowed</h1>
  8. # <p>The method is not allowed for the requested URL.</p>

Screenshot_2021-05-28-23-23-34-284_tv.danmaku.bili.jpg
Screenshot_2021-05-28-23-25-52-758_tv.danmaku.bili.jpg
Screenshot_2021-05-28-23-25-41-155_tv.danmaku.bili.jpg

url构造方法

  1. import requests
  2. # 主要用在get请求里
  3. data = {'key1':'value1','key2':'value2'}
  4. response = requests.get(url='http://httpbin.org/get',params=data)
  5. # 查看当前请求的url是谁
  6. print(response.url)
  7. # 输出结果:http://httpbin.org/get?key1=value1&key2=value2
  8. print(response.headers)
  9. # 输出结果:{'Date': 'Thu, 03 Jun 2021 03:40:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '377', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

Screenshot_2021-05-28-23-26-17-055_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-27-39-486_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-28-15-237_tv.danmaku.bili.jpgScreenshot_2021-05-28-23-31-02-288_tv.danmaku.bili.jpg

响应图片

  1. import requests
  2. url = 'https://www.imooc.com/static/img/index/logo2020.png'
  3. response = requests.get(url)
  4. # 查看图片的二进制数据
  5. print(response.content)
  6. with open('imooc.png','wb') as f:
  7. # 写入图片数据
  8. f.write(response.content)

Screenshot_2021-05-28-23-31-56-754_tv.danmaku.bili.jpg
Screenshot_2021-05-28-23-36-44-875_tv.danmaku.bili.jpg

请求的内容

  • response.url
  • response.status_code
  • response.text
  • response.content
  • response.content.decode()
  • response.content.decode(“gbk”)
  • response.headers
  • response.json()
  • response.request.headers
  • response.cookies

    请求代码练习1

    ```python import requests

response = requests.get(url=’http://httpbin.org/ip‘) print(response.content)

输出结果:b’{\n “origin”: “211.142.27.39”\n}\n’

print(response.text)

输出结果:

{

“origin”: “211.142.27.39”

}

print(type(response.text))

print(response.json())

输出结果:{‘origin’: ‘211.142.27.39’}

print(response.json()[‘origin’])

输出结果:211.142.27.39

查看请求头

print(response.request.headers)

输出结果:

{‘User-Agent’: ‘python-requests/2.25.0’, ‘Accept-Encoding’: ‘gzip, deflate’, ‘Accept’: ‘/‘, ‘Connection’: ‘keep-alive’}

print(type(response.request.headers))

输出结果:

print(response.cookies)

输出结果:

  1. ![Screenshot_2021-05-28-23-40-49-300_tv.danmaku.bili.jpg](https://cdn.nlark.com/yuque/0/2021/jpeg/21658471/1622217452873-e95c0aa7-fac5-45ed-81d5-9e7b2d35eeec.jpeg#height=1080&id=jAYu6&margin=%5Bobject%20Object%5D&name=Screenshot_2021-05-28-23-40-49-300_tv.danmaku.bili.jpg&originHeight=1080&originWidth=2248&originalType=binary&size=418793&status=done&style=none&width=2248)![Screenshot_2021-05-28-23-42-33-525_tv.danmaku.bili.jpg](https://cdn.nlark.com/yuque/0/2021/jpeg/21658471/1622217456304-b5afbf50-92d5-4448-936d-d4e283446f3b.jpeg#height=1080&id=bZZmS&margin=%5Bobject%20Object%5D&name=Screenshot_2021-05-28-23-42-33-525_tv.danmaku.bili.jpg&originHeight=1080&originWidth=2248&originalType=binary&size=372202&status=done&style=none&width=2248)
  2. <a name="iCpcs"></a>
  3. ## cookies
  4. ```python
  5. import requests
  6. url = 'http://httpbin.org/cookies'
  7. response = requests.get(url)
  8. print(response.text)
  9. # 输出结果:
  10. # {
  11. # "cookies": {}
  12. # }
  13. # 通过dict进行实例化cookies
  14. cookies = dict(cookies_are='hello imooc')
  15. # 请求的时候要带上cookies
  16. response = requests.get(url,cookies=cookies)
  17. print(response.text)
  18. # 输出结果:
  19. # {
  20. # "cookies": {
  21. # "cookies_are": "hello imooc"
  22. # }
  23. # }

b352aab3e72d5dfeb5afd2280c01997.jpg
c170923bb19a4d5489c71ec704d9a65.jpg

4.requests模块进阶

查看cookies

  1. import requests
  2. # 查看cookies
  3. url = 'http://www.baidu.com'
  4. response = requests.get(url)
  5. print(response.headers)
  6. # 输出结果:
  7. # {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Thu, 03 Jun 2021 03:59:48 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
  8. print(response.cookies)
  9. # 输出结果:<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
  10. print(response.cookies['BDORZ'])
  11. # 输出结果:27315

6387624ea6a2957309ae5d6da44024c.jpg

证书校验

  1. import requests
  2. # 关闭SSL校验
  3. response = requests.get(url='http://www.baidu.com',verify=False)
  4. print(response.text)

a0cff4a4f36ef322be01f4023c0cbc1.jpg
f3a7bf413680dcee1b57fc519cd0d7b.jpg
871d28fb0ffb548e5e75cf25ed819d0.jpg
微信图片_20210601164949.jpg
13ae951b1b2ee7f0dec7fc30749960f.jpg

异常

  1. import requests
  2. # 超时参数timeout
  3. start_time = time.time()
  4. # 一般设置为2-3秒
  5. response = requests.get(url='https://www.imooc.com/',timeout=2)
  6. print(response.text)
  7. end_time = time.time()
  8. print(end_time-start_time)

微信图片_202106011649492.jpg

微信图片_202106011649493.jpg

  1. import requests
  2. url = 'http://www.baidu.com'
  3. response = requests.get(url)
  4. print(response.status_code)
  5. print(response.text) # 输出内容有乱码
  6. print(response.content.decode())
  7. print(type(response.content.decode())) # 输出为"str"
  8. html_str = response.content.decode()
  9. print(type(html_str))
  10. with open('baidu.html','w',encoding='utf-8') as f:
  11. f.write(response.content.decode())
  12. print(response.url)
  13. print(response.headers)
  14. print(response.content)
  15. print(type(response.content)) # 二进制数据
  16. print(response.json())
  17. print(response.request.headers)

cookies

学习网址

http://account.chinaunix.net/login
第一步,更多工具—>清除浏览数据
微信图片_202106011649494.jpg微信图片_202106011649495.jpg

在线工具

时间戳转换:https://tool.lu/timestamp/
时间戳:1603521445734 后3位删掉进行转换
微信图片_202106011649498.jpg

set-cookies

微信图片_202106011649496.jpg微信图片_202106011649497.jpg

微信图片_202106011649499.jpg微信图片_2021060116494910.jpg微信图片_2021060116494911.jpg微信图片_2021060116494912.jpg

代码练习

微信图片_2021060116494913.jpg微信图片_2021060116494914.jpg微信图片_2021060116494915.jpg微信图片_2021060116494916.jpg微信图片_2021060116494917.jpg微信图片_2021060116494918.jpg微信图片_2021060116494919.jpg微信图片_2021060116494920.jpg

微信图片_2021060116494922.jpg微信图片_2021060116494923.jpg微信图片_2021060116494924.jpg微信图片_2021060116494925.jpg微信图片_2021060116494926.jpg微信图片_2021060116494927.jpg

代理信息

微信图片_2021060116494928.jpg微信图片_2021060116494931.jpg微信图片_2021060116494932.jpg

5.xpath基础语法

什么是xpath

微信图片_2021060116494933.jpg
微信图片_2021060116494934.jpg
微信图片_2021060116494937.jpg

微信图片_2021060116494939.jpg
微信图片_2021060116494940.jpg
微信图片_2021060116494941.jpg

xpath helper安装方法

  • 更多工具—>扩展程序—>chrom应用商店

微信图片_2021060116494935.jpg
微信图片_2021060116494938.jpg

6.lxml模块

什么是lxml库

微信图片_2021060116494942.jpg
微信图片_2021060116494943.jpg
微信图片_2021060116494944.jpg

代码练习1

  1. from lxml import etree
  2. # 不包含html标签和body标签
  3. data = """
  4. <div>
  5. <ul>
  6. <li class="item-0"><a href="link1.html">first item</a></li>
  7. <li class="item-1"><a href="link2.html">second item</a></li>
  8. <li class="item-3"><a href="link3.html">third item item</a></li>
  9. <li class="item-4"><a href="link4.html">fouth item</a></li>
  10. </ul>
  11. </div>
  12. """
  13. # 添加了html和body标签
  14. html = etree.HTML(data)
  15. print(html)
  16. # 输出结果:<Element html at 0x1db5588>
  17. print(etree.tostring(html).decode())

微信图片_2021060116494948.jpg
微信图片_2021060116494947.jpg
微信图片_2021060116494946.jpg

代码练习2

  1. from lxml import etree
  2. # 不包含html标签和body标签
  3. data = """
  4. <div>
  5. <ul>
  6. <li class="item-0"><a href="link1.html">first item</a></li>
  7. <li class="item-1"><a href="link2.html">second item</a></li>
  8. <li class="item-3"><a href="link3.html">third item item</a></li>
  9. <li class="item-4"><a href="link4.html">fouth item</a></li>
  10. </ul>
  11. </div>
  12. """
  13. # 添加了html和body标签
  14. html = etree.HTML(data)
  15. print(html)
  16. # 输出结果:<Element html at 0x1db5588>
  17. print(etree.tostring(html).decode())
  18. print(html.xpath('//li'))
  19. # 输出结果:[<Element li at 0x291d208>, <Element li at 0x291d1c8>, <Element li at 0x291d2c8>, <Element li at 0x291d308>]
  20. print(html.xpath('//li/@class'))
  21. # 输出结果:['item-0', 'item-1', 'item-3', 'item-4']
  22. print(html.xpath('//li/@class="item-0"'))
  23. print(html.xpath("//li/@class='item-0'"))
  24. # 输出结果:True
  25. print(html.xpath('//li/a[@href="link1.html"]'))
  26. # 输出结果:[<Element a at 0x3009348>]
  27. print(html.xpath('//li[2]/a[1]'))
  28. print(html.xpath('//li[last()-1]/a/@href'))
  29. # 输出结果:['link3.html']

微信图片_2021060116494949.jpg
微信图片_2021060116494950.jpg
772c6357646d01831c733d4cef6c2fd1.jpg

7.安装mongodb数据库

安装数据库
第一步,查看是否安装成功
netstat -an
Screenshot_2021-05-28-19-33-32-645_tv.danmaku.bili.jpg

第二种方式查看是否成功
Screenshot_2021-05-28-19-40-17-247_tv.danmaku.bili.jpg

Screenshot_2021-05-28-19-40-30-958_tv.danmaku.bili.jpg

第三步,修改ip地址
bin文件夹没找到cfg文件,修改接口ip地址0.0.0.0
Screenshot_2021-05-28-19-45-15-065_tv.danmaku.bili.jpg
Screenshot_2021-05-28-19-45-54-068_tv.danmaku.bili.jpg
再次进去服务,点击数据库重启动
Screenshot_2021-05-28-19-47-19-026_tv.danmaku.bili.jpg

第四步,查看地址是否修改成功
Screenshot_2021-05-28-19-51-22-712_tv.danmaku.bili.jpg

8.使用Navicat连接mongodb数据库

第一步,安装
Screenshot_2021-05-28-20-01-05-647_tv.danmaku.bili.jpg
第二步,配置连接数据库
Screenshot_2021-05-28-20-01-42-033_tv.danmaku.bili.jpgScreenshot_2021-05-28-20-02-52-496_tv.danmaku.bili.jpgScreenshot_2021-05-28-20-03-29-731_tv.danmaku.bili.jpg
点击测试
Screenshot_2021-05-28-20-04-14-171_tv.danmaku.bili.jpg
第三步,新建查询
Screenshot_2021-05-28-20-05-17-563_tv.danmaku.bili.jpg
第四步,几个命令查询
Screenshot_2021-05-28-20-10-01-586_tv.danmaku.bili.jpg
Screenshot_2021-05-28-20-10-21-429_tv.danmaku.bili.jpg
Screenshot_2021-05-28-20-10-33-325_tv.danmaku.bili.jpg
Screenshot_2021-05-28-20-10-54-050_tv.danmaku.bili.jpg
Cookies学习使用