第3周-关于网页解析器的使用 - 《爬虫课程实验报告》

学期：2021-2022学年第一学期

学院	大数据与智能工程学院	年级、专业、班	18级数据科学与大数据技术（专升本）一班	姓名		学号
实验项目名称		对Chinaunix.net页面、标题数据存储

实验学时： 3h 同组学生姓名：王美琴、尤博欣、周青青、李昕辰实验地点： 9317
实验日期：实验成绩：批改教师：批改时间：
指导教师评阅：

实验目的：对Chinaunix.net页面、标题数据存储
实验原理：requests请求、BeautifulSoup库、bs4网页解析
实验环境：win10、python3.9、vscode、edge
实验步骤：
1. Requests网页请求，获取网页源代码
2. 返回的豆瓣读书主页数据保存为html文件

核心代码：

##BeautifulSoup
import requests
from bs4 import BeautifulSoup
url = 'http://bbs.chinaunix.net/forum-55-1.html'
req = requests.get(url)
html = req.text
bf = BeautifulSoup(html,"lxml")
#print(html)
findall2 = bf.find_all(attrs={"class":"xst"})
# print(findall2)
file = open(".w2.txt", "w",encoding="utf-8")
for i in findall2:
    print(i.get_text())
    file.write(i.get_text()+"\n")
file.close()
##Xpath
import urllib.request as ur
from lxml import etree
url = "http://bbs.chinaunix.net/forum-55-1.html"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}
re = ur.Request(url,headers=headers)
rp = ur.urlopen(re)
file1 = rp.read().decode("gbk")
with open("zrh_chinaunix.html","w",encoding = "gbk") as file:
    file.write(file1)
html = etree.parse(r"chinaunix.html", etree.HTMLParser())
result_title =  html.xpath('//tbody/tr/th[@class="new"]/a/text()')
print(result_title)
file = open(".w1.txt", "w",encoding="utf-8")
for i in result_title:
    file.write(i+"\n")
file.close()

实验结果及分析：

通过requests库,并添加网页请求头，使用代理IP，请求网页url，得到网页数据源代码，进行数据整理提取，最后存入html文件。

实验总结：

通过这次实现豆瓣读书主页存储的实验，小组成员掌握了Xpath、BeautifulSoup库的定义与具体操作，从页面抓取到存储页面，从对其内容解析到结果存储，并且了解到类似于BeautifulSoup中的selector0删除谓语部分不可行，思路应为”先抓大，后抓小,寻找循环点”。