layout: posttitle: Python3.7抓取智联招聘岗位列表及详情数据
subtitle: Python3.7抓取智联招聘岗位列表及详情数据
date: 2019-03-18
author: he xiaodong
header-img: img/default-post-bg.jpg
catalog: true
tags:
- Python
- 爬取智联招聘数据
- fake_useragent

Python3.7爬取智联招聘数据,观察智联招聘的页面就可以看到是有个接口,获取的数据,直接拼接参数请求接口就可以的,切换头信息和抓取中 sleep 一下,这样能躲过去。执行时候,缺少的包 pip install requests 这样安装就可以

SQL结构

  1. CREATE TABLE `jobs` (
  2. `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'id',
  3. `keyword` varchar(255) DEFAULT NULL COMMENT '关键词',
  4. `city` varchar(255) DEFAULT NULL COMMENT '城市',
  5. `company` varchar(255) DEFAULT NULL COMMENT '公司名称',
  6. `size` varchar(20) DEFAULT NULL COMMENT '规模',
  7. `type` varchar(10) DEFAULT NULL COMMENT '性质',
  8. `company_url` varchar(255) DEFAULT NULL COMMENT '公司链接',
  9. `eduLevel` varchar(20) DEFAULT NULL COMMENT '教育程度',
  10. `emplType` varchar(20) DEFAULT NULL COMMENT '职业类型',
  11. `jobName` varchar(50) DEFAULT NULL COMMENT '工作名称',
  12. `jobTag` varchar(200) DEFAULT NULL COMMENT '福利',
  13. `jobType` varchar(200) DEFAULT NULL COMMENT '方向',
  14. `position` text COMMENT '职位详情',
  15. `positionURL` varchar(200) DEFAULT NULL COMMENT '招聘链接',
  16. `rate` varchar(10) DEFAULT NULL COMMENT '反馈率',
  17. `salary` varchar(20) DEFAULT NULL COMMENT '工资',
  18. `workingExp` varchar(10) DEFAULT NULL COMMENT '工作经验',
  19. `city_code` varchar(10) DEFAULT NULL COMMENT '城市编号',
  20. `create_time` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  21. PRIMARY KEY (`id`)
  22. ) ENGINE=InnoDB AUTO_INCREMENT=4321 DEFAULT CHARSET=utf8mb4;

jobs.py

  1. # -*- coding: utf-8 -*-
  2. import pymysql
  3. import requests
  4. import time
  5. from fake_useragent import UserAgent
  6. import urllib.parse
  7. from requests_html import HTMLSession
  8. def parse_page(keyword, url, city_code):
  9. try:
  10. ua = UserAgent(verify_ssl=False) # verify_ssl = false 能避免报错
  11. headers = {'User-Agent': ua.random}
  12. print(headers)
  13. response = requests.get(url, headers=headers).json()
  14. result = response['data']['results']
  15. for r in result:
  16. keyword = urllib.parse.unquote(keyword)
  17. city = r['city']['display']
  18. company = r['company']['name']
  19. size = r['company']['size']['name']
  20. type = r['company']['type']['name']
  21. company_url = r['company']['url']
  22. eduLevel = r['eduLevel']['name']
  23. emplType = r['emplType']
  24. jobName = r['jobName']
  25. jobTag = r['jobTag']['searchTag']
  26. jobType = r['jobType']['display']
  27. positionURL = r['positionURL'] #招聘链接
  28. rate = r['rate'] #反馈率
  29. salary = r['salary'] #工资
  30. workingExp = r['workingExp']['name'] #工作经验
  31. # 抓取职位详情
  32. session = HTMLSession()
  33. detail = session.get(positionURL)
  34. position = detail.html.find('.pos-ul', first=True).text
  35. insert_sql = """insert into jobs(keyword,city,company,size,type,company_url,eduLevel,emplType,jobName,jobTag,jobType,position,positionURL,rate,salary,workingExp,city_code) values ('{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}'); """.format(keyword,city,company,size,type,company_url,eduLevel,emplType,jobName,jobTag,jobType,position,positionURL,rate,salary,workingExp,city_code)
  36. try:
  37. cur.execute(insert_sql)
  38. conn.commit()
  39. except Exception as e:
  40. print(e)
  41. pass
  42. except Exception as e:
  43. print(e)
  44. pass
  45. def parse_main(url, pages, city_code, job):
  46. for page in range(pages):
  47. p = page*90
  48. url_r = url.format(page=p, city_code=city_code,job=job)
  49. parse_page(job, url_r, city_code)
  50. #time.sleep(5)
  51. if __name__ == '__main__':
  52. conn = pymysql.connect(
  53. host="127.0.0.1",
  54. user="root",
  55. password="123456",
  56. db="business",
  57. port=3306,
  58. charset="utf8"
  59. )
  60. cur = conn.cursor()
  61. url = "https://fe-api.zhaopin.com/c/i/sou?start={page}&pageSize=90&cityId={city_code}&industry=10100&salary=0,0&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw={job}&kt=3&=0&_v=0.33977872&x-zp-page-request-id=b0434b03d11e4b9daf4cf3a887fbd121-1547573058264-851670"
  62. pages = 3
  63. job = '%E7%A0%94%E5%8F%91' #研发
  64. city_code_list = ['530','765','538','763']
  65. # city_code='530' #北京:530; 全国:489,深圳:765,上海:538,广州:763
  66. for city_code in city_code_list:
  67. parse_main(url,pages,city_code,job)
  68. conn.close()

先建立对应的数据库,我在本地的数据库是 business, 然后在命令行下执行py jobs.py 进行爬取数据,测试爬取正常,效果图:

2019-03-18-Python3.7抓取智联招聘岗位列表及详情数据 - 图1

参考链接:Github 地址

最后恰饭 阿里云全系列产品/短信包特惠购买 中小企业上云最佳选择 阿里云内部优惠券