网址:http://shanzhi.spbeen.com/index/
一、网站分析
1、首先该网站需要登录,2、登录之后获取网页源码3、找到字体文件并找到替换关系替换爬取下来的数据,处理得到正常的数据如何找到字体文件:1、 定位找到进行了字体反爬的位置,在对应的styles里找到font-family2、复制font-family里面的值,去网页源码里搜索3、在搜索结果附近找到xxx.ttf这样的url进行下载如何需要通过python去读取识别字体文件里面的内容pip install fontTools -i https://pypi.tuna.tsinghua.edu.cn/simple
二、程序实现
1、登录
from fontTools.ttLib import TTFontfrom lxml import etreefrom tools import get_jsimport requests# 获取网页源码url = 'http://shanzhi.spbeen.com/login/'# 请求头header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36','Cookie': 'shanzhi_kmer=h9wm0ptr9kcuza527as3a43a6zuwzsid; csrftoken=yHdq0AaPEO1pyiC2zji4MmeyLRNcXVcZLNoHCT3izQ52lwvASvHv0jgsGG9kEXUN'}# 先从网页源码中得到csrfmiddlewaretoken和pk# tips:以后看到token 下意识去网页源码中找reponse_obj = requests.get(url, headers=header)# print(reponse_obj.text)# 加载一个element对象tree = etree.HTML(reponse_obj.text)# 获取csrfmiddlewaretokencsrfmiddlewaretoken = tree.xpath('//input[@name="csrfmiddlewaretoken"]/@value')[0]# 获取pkpk = tree.xpath('//input[@id="pk"]/@value')[0]# pk就是公钥# pk = "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDaP+rYm6rqTMP565UmMU6YXq46KtAN3zwDSO8LNa15p0lJfsaY8jXY7iLsZqQZrGYr2Aayp6hYZy+Q+AMB/VUiSpD9ojPyOQ7r9jsf9jZbTOL4kj6iLZn37fEhp4eLvRgy5EJCyQoFyLCsgLechBTlYl2eA95C3j4ZUFhiV6WFHQIDAQAB"old_password = 'logic_00' 密码# 使用js环境进行的加密处理# 这里引入了tools.py文件中的get_js方法,用于执行jspassword = get_js('./shanzhi.js', 'doLogin', old_password, pk)data_dict = {'username': 'logic_00','password': password,'csrfmiddlewaretoken': csrfmiddlewaretoken}res = requests.post(url, headers=header, data=data_dict)# res.text 是待替换的源码html = res.text
import execjsdef get_js(file_name,fun_name,*args):with open(file_name,'r',encoding='utf-8') as file_obj:js_code = file_obj.read()# 1、编译js文件cjs = execjs.compile(js_code)# 2、 执行js代码return cjs.call(fun_name,*args)
// 引用安装node-jsencrypt1、CD进入项目目录2、做国内资源隐射:npm install -g cnpm --registry=https://registry.npm.taobao.org3、安装:cnpm install node-jsencryptconst JSEncrypt = require('node-jsencrypt');function doLogin(pass_old, pk) {var password_old = pass_old;var encrypt = new JSEncrypt();var public_key = pk;encrypt.setPublicKey(public_key);var pass_new = encrypt.encrypt(password_old);return pass_new;}
2、通过替换字典替换到真实数据
for k, v in replace_dict.items():# TypeError: replace() argument 2 must be str, not inthtml = html.replace(k, v)print(html)
