写了个简单爬虫把安全客上的SRC列表爬了下来
代码:
import requests
import re
url = "https://www.anquanke.com/src"
def src(url):
res = requests.get(url).content.decode('utf-8')
return res
code = re.findall(r'<a target="_blank" rel="noopener noreferrer" href="/src/(\d+)">',str(src(url)))
for codes in code:
res2 = requests.get("https://www.anquanke.com/src/" + codes).content.decode('utf-8')
srcname = re.findall(r'<title>(.*) - 安全客',res2)
urladdress = re.findall(r'<h2>网址.*href="(.*?)">.*<h2>漏洞提交入口</h2>',res2,flags=re.DOTALL)
str_srcname = "".join(srcname)
str_urladdress = "".join(urladdress)
print(r"名称:{0} , 地址:{1} ".format(str_srcname,str_urladdress))
效果:
学到的东西:
正则“.”匹配时是默认不匹配换行符的,后面加个“flags=re.DOTALL”就好了