涉及知识:
re模块的使用,字符编码,字典的使用


案例目标:

  1. 爬取某个区域电影的名称
  2. 爬取电影的下载链接

    1.分析网站结构

    image.png
    image.png

    2.分析网站源代码

    image.png ```python import requests,json,re

url = “https://www.dydytt.net/index2.htm“ host = “https://www.dydytt.net“ headers = { “User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36”, “Referer”: “https://www.dydytt.net/“ } domain = requests.get(url,headers=headers) domain.encoding = “gb2312” #字符编码 file = domain.text

定义获取电影天堂的主页面位置的“最新经典影片推荐”

obj = re.compile(r”最新经典影片推荐.?

    (?P.?)
“, re.S) #使用re的compile模块定义好正则语句

定位“最新经典影片推荐”内单个电影界面的链接

obj2 = re.compile(r”.*?
“)

list = obj.finditer(file) #将re的compile定义的正则语句,与re.finditer结合对获取的HTML文件进行查找 for i in list:
url = i.group(“url”) #将输出的url再一次使用re模块记性查找 list2 = obj2.finditer(url) for o in list2: full_url = host + o.group(“href”) #将链接拼接 print(full_url)


输出效果展示: D:\学习软件工具\pycharm\openstack-api\venv\Scripts\python.exe D:/学习软件工具/pycharm/openstack-api/venv/flavor/pachong002.py https://www.dydytt.net/html/gndy/jddy/20211023/61974.html https://www.dydytt.net/html/gndy/jddy/20211010/61933.html https://www.dydytt.net/html/gndy/jddy/20210820/61757.html https://www.dydytt.net/html/gndy/jddy/20210816/61740.html https://www.dydytt.net/html/gndy/jddy/20210811/61725.html …….

  1. 获取到每个电影的页面链接后再次分析页面的下地址
  2. <a name="qXdQO"></a>
  3. # 3.分析单个电影界面源代码
  4. 分析当页面源代码后通过正则表达式匹配到下载地址<br />![image.png](https://cdn.nlark.com/yuque/0/2021/png/23046225/1639726970892-1170a163-3113-4960-afe9-d03debc4d44b.png#clientId=uf3ba4744-c32a-4&from=paste&height=305&id=u2aadf473&margin=%5Bobject%20Object%5D&name=image.png&originHeight=609&originWidth=1914&originalType=binary&ratio=1&size=80881&status=done&style=none&taskId=u420d5b6d-e8ab-48f5-a958-0379056f8d4&width=957)
  5. ```python
  6. host = "https://www.dydytt.net"
  7. headers = {
  8. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
  9. "Referer": "https://www.dydytt.net/"
  10. }
  11. domain = requests.get(url,headers=headers)
  12. domain.encoding = "gb2312"
  13. file = domain.text
  14. obj = re.compile(r"最新经典影片推荐.*?<ul>(?P<url>.*?)</ul>", re.S)
  15. obj2 = re.compile(r"<a href='(?P<href>.*?)'>.*?</a><br/>")
  16. #匹配单个电影页面下载地址
  17. obj3 = re.compile(r'◎片  名 (?P<movie_name>.*?).*?下载地址2:<a href=(?P<movice_download_url>.*?) target="_blank" title="迅雷电影">',re.S)
  18. #添加一个空字典
  19. movie_url_list = []
  20. list = obj.finditer(file)
  21. for i in list:
  22. url = i.group("url")
  23. list2 = obj2.finditer(url)
  24. for o in list2:
  25. full_url = host + o.group("href")
  26. #print(full_url)
  27. movie_url_list.append(full_url) #将获取的单个电影页面的链接添加到空字典内
  28. for dl in movie_url_list: #将字典内的链接再次遍历取出来
  29. rsp = requests.get(dl)
  30. rsp.encoding = "gb2312"
  31. #print(rsp.text) #这里使用的是re.search模块
  32. ll = obj3.search(rsp.text) #通过遍历取出来的单个电影界面通过正则表达式在寻找下载地址
  33. print(ll.group("movie_name"))
  34. print(ll.group(("movice_download_url")))

输出展示:(由于网页改动所以爬的下载地址是一样的)

  1. D:\学习软件工具\pycharm\openstack-api\venv\Scripts\python.exe D:/学习软件工具/pycharm/openstack-api/venv/flavor/pachong002.py
  2. "https://www.ygdy8.net/"
  3. "https://www.ygdy8.net/"