‘’’
    目标网站:https://careers.tencent.com/search.html?pcid=40001
    爬取内容:职位,位置,详情页url
    要求:
    1)爬取前10页数据
    2)利用单线程完成
    3)消费者,生产者模式完成
    4)将数据保存到csv中
    分析:
    查看源网页,没有找到相关的内容,证明该网页是动态显示内容的,
    现在方法有2:
    1)用selenium来爬取,但有一缺点就是慢,若想用多线程的话不合适
    2)只有按F12打开开发者工具,点network=>XHR,进一步来确定目标url
    找到一个:https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1650559854214&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40001&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
    但太长了,看看哪个选项可以不要的.
    https://careers.tencent.com/tencentcareer/api/post/Query?parentCategoryId=40001&pageIndex=1&pageSize=10&area=cn
    https://careers.tencent.com/tencentcareer/api/post/Query?parentCategoryId=40001&pageIndex={page}&pageSize=10&area=cn
    ‘’’

    1. import requests
    2. import json
    3. import csv
    4. import threading
    5. from queue import Queue
    6. class Producer(threading.Thread):
    7. headers = {
    8. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'
    9. }
    10. jobs_list = []
    11. def __init__(self, page_queue):
    12. super(Producer, self).__init__()
    13. self.page_queue = page_queue
    14. def run(self):
    15. while True:
    16. if self.page_queue.empty():
    17. break
    18. url = self.page_queue.get()
    19. self.parse_page(url)
    20. def parse_page(self,url):
    21. res = requests.get(url,headers=Producer.headers) # 返回的结果res.text是json
    22. jobs = json.loads(res.text)['Data']['Posts'] # 用json.loads()方法把json转为dict,再用字典的key提取数据
    23. for job in jobs:
    24. j = []
    25. Title = job['RecruitPostName'].split('-')[1] # 把前面的编号去掉
    26. Address = job['LocationName']
    27. DetailUrl = job['PostURL']
    28. j.append([Title, Address, DetailUrl])
    29. Producer.jobs_list.extend(j)
    30. self.saveData(Producer.jobs_list)
    31. def saveData(self,data):
    32. with open('jobs2.csv', 'w', encoding='utf-8', newline='') as f:
    33. wt = csv.writer(f)
    34. wt.writerow(['Title', 'Address', 'DetailUrl'])
    35. wt.writerows(data)
    36. print('done')
    37. if __name__ == '__main__':
    38. # 存放url的队列
    39. page_queue = Queue()
    40. for i in range(1, 11):
    41. url = f'https://careers.tencent.com/tencentcareer/api/post/Query?parentCategoryId=40001&pageIndex={i}&pageSize=10&area=cn'
    42. page_queue.put(url)
    43. p_list=[]
    44. for i in range(3):
    45. t = Producer(page_queue)
    46. t.start()
    47. p_list.append(t)
    48. for p in p_list:
    49. p.join()

    附前10页内容:
    腾讯招聘前10页.png