多线程_爬_腾讯招聘前10页 - 《爬虫》

‘’’
目标网站：https://careers.tencent.com/search.html?pcid=40001
爬取内容：职位，位置，详情页url
要求：
1)爬取前10页数据
2)利用单线程完成
3)消费者，生产者模式完成
4)将数据保存到csv中
分析:
查看源网页,没有找到相关的内容,证明该网页是动态显示内容的,
现在方法有2:
1)用selenium来爬取,但有一缺点就是慢,若想用多线程的话不合适
2)只有按F12打开开发者工具,点network=>XHR,进一步来确定目标url
找到一个:https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1650559854214&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40001&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
但太长了,看看哪个选项可以不要的.
https://careers.tencent.com/tencentcareer/api/post/Query?parentCategoryId=40001&pageIndex=1&pageSize=10&area=cn
https://careers.tencent.com/tencentcareer/api/post/Query?parentCategoryId=40001&pageIndex={page}&pageSize=10&area=cn
‘’’

import requests
import json
import csv
import threading
from queue import Queue
class Producer(threading.Thread):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'
    }
    jobs_list = []
    def __init__(self, page_queue):
        super(Producer, self).__init__()
        self.page_queue = page_queue
    def run(self):
        while True:
            if self.page_queue.empty():
                break
            url = self.page_queue.get()
            self.parse_page(url)
    def parse_page(self,url):
        res = requests.get(url,headers=Producer.headers)  # 返回的结果res.text是json
        jobs = json.loads(res.text)['Data']['Posts']  # 用json.loads()方法把json转为dict,再用字典的key提取数据
        for job in jobs:
            j = []
            Title = job['RecruitPostName'].split('-')[1]  # 把前面的编号去掉
            Address = job['LocationName']
            DetailUrl = job['PostURL']
            j.append([Title, Address, DetailUrl])
            Producer.jobs_list.extend(j)
        self.saveData(Producer.jobs_list)
    def saveData(self,data):
        with open('jobs2.csv', 'w', encoding='utf-8', newline='') as f:
            wt = csv.writer(f)
            wt.writerow(['Title', 'Address', 'DetailUrl'])
            wt.writerows(data)
        print('done')
if __name__ == '__main__':
    # 存放url的队列
    page_queue = Queue()
    for i in range(1, 11):
        url = f'https://careers.tencent.com/tencentcareer/api/post/Query?parentCategoryId=40001&pageIndex={i}&pageSize=10&area=cn'
        page_queue.put(url)
    p_list=[]
    for i in range(3):
        t = Producer(page_queue)
        t.start()
        p_list.append(t)
    for p in p_list:
        p.join()

附前10页内容:
腾讯招聘前10页.png