参考博文:https://blog.csdn.net/Morgan_fang1/article/details/95165316
标签含义
: 定义表格
: 定义表格的页眉
: 定义表格的主体
: 定义表格的行
: 定义表格的表头
| : 定义表格单元
方法一:beautifulsoup+pd.read_htmlr = self.session.get(url_d, params=params_d, headers=self.headers)
# 解析网页 soup = BeautifulSoup(r.text, 'lxml') content = soup.select('body > div > div > div > div > div.ibox-content > table')[0] # [0]将返回的list改为bs4类型,参数为selector选择器 tbl = pd.read_html(content.prettify(), header=0)[0]
tbl.to_excel('test.xlsx', index=False)
缺点:对于首行不是标题或者合并居中单元格的网页数据不友好
方法二:etree.HTML+xpathhtml = etree.HTML(r.content.decode('utf-8')) table = html.xpath("//table/tbody")[0] # XPath定位到表格,因为页面只有一个表格,所以直接//table,# 如果有多个表格,如取第二个表格,则写为//table[1] 偏移量为1 。我们不取表头信息,所以从tr[3]开始取,返回一个列表 short_info = { ''.join(table.xpath(".//tr[1]/td[2]/b/text()")): ''.join(table.xpath(".//tr[1]/td[3]/text()")), ''.join(table.xpath(".//tr[1]/td[4]/b/text()")): ''.join(table.xpath(".//tr[1]/td[5]/text()")), ''.join(table.xpath(".//tr[1]/td[6]/b/text()")): ''.join(table.xpath(".//tr[1]/td[7]/text()")), ''.join(table.xpath(".//tr[1]/td[8]/b/text()")): ''.join(table.xpath(".//tr[1]/td[9]/text()"))} # 获取文本的另一种方法:当值为空时,列表索引[0]报错,而上面这种方法则没有这种问题 short_info = { table.xpath(".//tr[1]/td[2]/b/text()")[0]: table.xpath(".//tr[1]/td[3]/text()")[0], table.xpath(".//tr[1]/td[4]/b/text()")[0]: table.xpath(".//tr[1]/td[5]/text()")[0], table.xpath(".//tr[1]/td[6]/b/text()")[0]: table.xpath(".//tr[1]/td[7]/text()")[0], table.xpath(".//tr[1]/td[8]/b/text()")[0]: table.xpath(".//tr[1]/td[9]/text()")[0]} cols = html.xpath("//table/tbody/tr[3]/td") cols = [''.join(x.xpath(".//text()")) for x in cols] # 标题行 data = [] rows = html.xpath("//table/tbody/tr")[3:] # 数据行,从第4个开始 for row in rows: row = row.xpath(".//td") r_data = [] for cell in row: val = ''.join(cell.xpath(".//text()")) val = val.replace('\n', "").strip() # 删除换行符、空格 try: val = float(val) except: pass r_data.append(val) data.append(r_data) # 创建dataframe df = pd.DataFrame(data=data, columns=cols)
|
---|