第10周-中国水稻数据中心地区水稻品种数据获取 - 《爬虫课程实验报告》

学期：2021-2022学年第一学期

学院	大数据与智能工程学院	年级、专业、班	18级数据科学与大数据技术（专升本）一班	姓名		学号
实验项目名称		中国水稻数据中心地区水稻品种数据获取

实验学时： 3h 同组学生姓名：王美琴、尤博欣、周青青、李昕辰实验地点： 9317
实验日期：实验成绩：批改教师：批改时间：
指导教师评阅：

实验目的：编写程序，实现水稻数据中心地区水稻品种数据获取
实验原理：requests请求、BeautifulSoup库
实验环境：win10、python3.9、vscode、edge
实验步骤：
1. Requests网页请求，获取网页源代码
2. 使用bs4语法解析网页数据
3. 返回的水稻品种数据保存为json文件

核心代码：

import urllib.request
from bs4 import BeautifulSoup
from lxml import etree
#查找指定字符下标并返回
def findSubStrIndex(substr, str, time):
    times = str.count(substr)
    if (times == 0) or (times < time):
        pass
    else:
        i = 0
        index = -1
        while i < time:
            index = str.find(substr, index+1)
            i+=1
        return index
def find_form(dress):
    response = urllib.request.urlopen(dress)
    data = response.read()
    soup = BeautifulSoup(data, "lxml")
    tag_table_s = soup.find_all("table")
    #Tbody无法被系统识别！！！！！！！！！！！！！！！！！
    #tag_tbody_s = tag_table_s[1].find_all("tbody")
    tag_tr_s = tag_table_s[1].find_all("tr")
    for tag_tr in tag_tr_s:
        content = []
        tag_td_s = tag_tr.find_all("td")
        for tag_td in tag_td_s:
            content.append(tag_td.text.strip())
        contents.append(content)
        print(content)
dress = 'https://www.ricedata.cn/variety/index.htm'
response = urllib.request.urlopen(dress)
data = response.read()
soup = BeautifulSoup(data, "lxml")
tag = soup.find("div", attrs={"style": "padding: 0 10px 0 10px; text-align:justify;text-justify:distribute-all-lines;text-align-last:justify; font-size:10pt; line-height:25px; font-weight:bold"})
tag_a_s = tag.find_all("a")
contents = [[]]
for tag_a in tag_a_s:
    try:
        d = tag_a["href"]
    except KeyError:
        print("没有href！！！")
    dress_1=dress[:-9]+d
    #分页操作
    response = urllib.request.urlopen(dress_1)
    data = response.read()
    soup = BeautifulSoup(data, "lxml")
    try:
        tag_a_paper = soup.find("a",attrs={"title":"跳至末页"})
        paper = tag_a_paper["href"]
        number = "".join(list(filter(str.isdigit, paper)))
        for i in range(1,int(number)+1):
            xb = findSubStrIndex("_", dress_1, 1)
            dress_4 = dress_1[:xb+1]+str(i)+".htm"
            #爬取页面函数
            print(dress_4)
            find_form(dress_4)
    except TypeError:
        tag = soup.find("caption")
        tag_a_s = tag.find_all("a")
        for tag_a in tag_a_s:
            dress_3 = tag_a["href"]
            xb = findSubStrIndex("/", dress_1, 5)
            dress_3 = dress_1[:xb+1]+dress_3
            # 爬取页面函数
            print(dress_3)
            find_form(dress_3)
fo = open("B.txt","w")
for content in contents:
    for i in content:
        fo.write(i+"   ")
    fo.write("\r")
fo.close()

实验结果及分析：

通过requests库,并添加网页请求头，使用代理IP，请求网页url，得到网页数据源代码，并通过BeautifulSoup库使用bs4语法，进行数据整理提取，最后存入json文件。

实验总结：

在进行网页爬取时，网站会存在一些反爬机制，如：是否添加请求头、IP访问是否超出网站限制等，在进行网站爬取时，需要注意网站的反爬机制，并使用相对应的反反爬，最终得到数据。