思路

爬取蜗牛网站下的多个文章

  1. 找出网站跳转规律 ,一般通过浏览器上的网站变化判断
  2. 找出对应的xpath路径
  3. 进到一篇文章内找出想要爬下来的内容
  4. 组合代码

1.首先爬取网站的所有链接

爬蜗牛网站内多个文档
image.png

  1. #导入所需的库
  2. import requests,json
  3. from lxml import etree
  4. from bs4 import BeautifulSoup
  5. import re
  6. from urllib.parse import urljoin
  7. base_url = "https://www.woniuxy.com/"
  8. url = 'https://www.woniuxy.com/note/page-1' #要爬取的网站
  9. headers = {
  10. 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
  11. }
  12. rsp = requests.get(url,headers=headers) #向网站发起请求
  13. html = etree.HTML(rsp.text) #将获取的请求用etree中HTML转化为HTML格式
  14. woniu_urls = html.xpath('//div[@class="title"]/a') #通过xpath获取所有子链接
  15. for url in woniu_urls: #if循环列出爬取的链接
  16. full_url = urljoin(base_url, url.get('href')) #将爬取下来的不是完整的链接,所有要合并成完整的链接
  17. print(full_url)
  18. ---------------------------------------------------------------------------------
  19. 输出效果展示:
  20. D:\学习软件工具\pycharm\openstack-api\venv\Scripts\python.exe D:/学习软件工具/pycharm/openstack-api/venv/flavor/flavor003.py
  21. https://www.woniuxy.com/note/820
  22. https://www.woniuxy.com/note/819
  23. https://www.woniuxy.com/note/818
  24. .................

2.抓取每篇文章的作者等信息

image.png

  1. import requests,json
  2. from bs4 import BeautifulSoup
  3. from lxml import etree
  4. import re
  5. url = "https://www.woniuxy.com/note/820"
  6. r = requests.get(url)
  7. html = etree.HTML(r.text)
  8. titl_obj = html.xpath("//div[contains(@class, 'info')]")[1]
  9. #titl_obj = html.xpath("/html/body/div[8]/div/div[1]/div[1]/div[3]/@class")
  10. result = re.findall('作者:(.*?)\s+类型:(.*?)\s+.*日期:(.*?)\s+阅读:(.*?)次', titl_obj.text, re.S)
  11. (user, type, date, read_num) = result[0]
  12. print("作者:" + user)
  13. print("类型:" + type)
  14. print("时间:" + date)
  15. print("阅读次数:" + read_num)
  16. -------------------------------------------------------------------------------------
  17. 输出展示:
  18. D:\学习软件工具\pycharm\openstack-api\venv\Scripts\python.exe D:/学习软件工具/pycharm/openstack-api/venv/pachong/pachong001.py
  19. 作者:管理员
  20. 类型:学院动态
  21. 时间:2021-12-10
  22. 阅读次数:225