用到的库
- requests
json
原理
导入所需的库
- 用迭代的方式枚举所有知乎链接
- 判断状态码是否是200
- 如果是200就打印链接
- 保存为txt文件
完整代码
``` import requests import json
def get_links(): links = [] nummber = 19550224 while nummber < 900000000: nummber = nummber + 1 urls = ‘https://www.zhihu.com/question/‘ + str(nummber) links.append(urls) return links
def write_to_file(content): with open(‘19550224.txt’, ‘a’, encoding=’utf-8’) as f: f.write(json.dumps(content, ensure_ascii=False) + ‘\n’) f.close()
def main():
links = get_links()
for link in links:
headers = {‘User-Agent’:’Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36’}
response = requests.get(link, headers=headers)
if response.status_code == 200:
print(link)
write_to_file(link)
if name == ‘main‘: main()
## 其他
需要知道知乎的第一个链接地址,当然也可以自己从 10000000 开始迭代自己找出来。<br />[https://www.zhihu.com/question/19550225](https://www.zhihu.com/question/19550225)<br />这是知乎的第一个问题, 编号是 19550225
## 优化后
把判断放到了循环内,这样就不用获取所有的nummer 再开始运算了。效率高了很多。<br />另外,把headers 放到全局,这样不用每一次都需要获取一次 headers
import requests import json
headers = {‘User-Agent’:’Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36’}
def write_to_file(content):
with open(‘zhihu.txt’, ‘a’, encoding=’utf-8’) as f:
f.write(json.dumps(content, ensure_ascii=False) + ‘\n’)
f.close()
def main(): nummber = 61158073 while nummber >= 61158073: nummber = nummber + 1 url = ‘https://www.zhihu.com/question/‘ + str(nummber) response = requests.get(url, headers=headers) if response.status_code == 200: print(url) write_to_file(url)
if name == ‘main‘: main()
```