3.2 Python爬取不凡商业动态数据 - 《Python网络爬虫》

【实验目的】
【实验原理】
【实验环境】
【实验步骤】
【实验结果展示】

【实验目的】

1、获取不凡商业数据标题，标题图片信息，发布时间等。

2、掌握PyCharm创建python项目。

3、掌握requests模块的基本使用。

4、掌握如何处理json数据格式。

【实验原理】

1、用户输入网址（假设是个html页面，并且是第一次访问），浏览器向服务器发出请求，服务器返回html文件；
2、然后浏览器从head标签开始逐行解析HTML代码，遇到link标签又会向服务器请求加载css文件，不过这个过程是异步的，有多个css文件，会多个同时加载；
3、继续往后如果遇到script标签或者js文件就会立即执行它，而且js文件的加载是同步的；
4、到了body标签就开始渲染页面了，按照从头到尾的顺序依次渲染dom元素，如果遇到img标签会异步向服务器发送请求加载图片文件，执行此过程时浏览器会继续渲染页面，因为加载图片文件是异步的；
5、如果遇到了dom节点的变化，元素尺寸变化，浏览器不得不回头重新渲染这部分代码；
6、不同于css文件，js是阻塞式的加载，当浏览器在执行js代码时，不会做其他的事情。只有js代码执行后，才会继续渲染页面。所以应该把js放到页面的底部。

【实验环境】

Linux Ubuntu 16.04
Python 3.5
PyCharm

【实验步骤】

1、双击打开开发工具PyCharm，创建一个python项目

2、创建一个项目起名为BuFan,指定项目存储的位置，指定对应的解释器。

3.右键点击BuFan，选择New=>Python File，创建名为bufan的python文件。

4.本次实验的目标地址如下：

https://www.xfz.cn/

5、按下F12打开开发工具进行查找网页元素，点击XHR进行分析。点击弹出的路径进行分析，找到对应的信息来源地址进行分析，

https://www.xfz.cn/api/website/articles/?p=3&n=20&type=

6、代码展示，分析动态数据接口。

# bufan_requests.py
__author__ = 'zhangyu'
import requests  
import json  
# send a http request.
response = requests.get('https://www.xfz.cn/api/website/articles/?p=3&n=20&type=') 
json_dict = response.json() # get a HttpResponse  
json_str = json.dumps(json_dict) # dict---> json_str  
with open('bufan.json', 'w', encoding='utf-8') as fp:  # write file  
    fp.write(json_str)

7、读取json文件，分析json数据格式，拿到对应的自己想要的数据。先将json数据格式进行解码为utf-8。

import json  
with open('bufan.json', 'r') as fp:  
    json_data = fp.read() # read a file  
    json_dict = json.loads(json_data, encoding='utf-8') 
for article_msg in json_dict['data']: 
    print(article_msg['title'])

8、获取到前10次请求所需要的数据。通过加入循环进行遍历，处理json字符串数据。

wait_times = [10, 20, 13, 14, 18]  
for index in range(1, 11): # set pageNum  
    print('Download the %s link, please wait a moment...' % index)  
    base_url = 'https://www.xfz.cn/api/website/articles/?p={}&n=20&type='
    response = requests.get(base_url.format(index))  
    time.sleep(random.choice(wait_times)) # random choice a time.  
    json_dic = response.json()  
    json_str = json.dumps(json_dic)  
with open('bufan_data_%s.json' % index, 'w', encoding='utf-8') as fp:  
    fp.write(json_str)  
    print('the %s dowload over......' % index)

9.完整代码为：

import json  
import requests  
import random  # random model  
import time  
wait_times = [10, 20, 13, 14, 18]  
for index in range(1, 11): # set pageNum  
    print('Download the %s link, please wait a moment...' % index)  
    base_url = 'https://www.xfz.cn/api/website/articles/?p={}&n=20&type='
    response = requests.get(base_url.format(index))  
    time.sleep(random.choice(wait_times)) # random choice a time.  
    json_dic = response.json()  
    json_str = json.dumps(json_dic)  
with open('bufan_data_%s.json' % index, 'w', encoding='utf-8') as fp:  
    fp.write(json_str)  
    print('the %s dowload over......' % index)  
# 点击右键run，运行程序。

【实验结果展示】

本次实验任务到此结束，请同学们课下反复练习，熟悉json字符串的处理。