常见反爬策略及应对方案

  1. 构造合理的HTTP请求头。

    • Accept

    • User-Agent - 三方库fake-useragent ```python from fake_useragent import UserAgent ua = UserAgent()

ua.ie

Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);

ua.msie

Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)’

ua[‘Internet Explorer’]

Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)

ua.opera

Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11

ua.chrome

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2’

ua.google

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13

ua[‘google chrome’]

Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11

ua.firefox

Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1

ua.ff

Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1

ua.safari

Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25

and the best one, random via real world browser usage statistic

ua.random

  1. -
  2. Referer
  3. -
  4. Accept-Encoding
  5. -
  6. Accept-Language
  7. 2.
  8. 检查网站生成的Cookie
  9. - 有用的插件:[EditThisCookie](http://www.editthiscookie.com/)
  10. - 如何处理脚本动态生成的Cookie
  11. 3.
  12. 抓取动态内容。
  13. - Selenium + WebDriver
  14. - Chrome / Firefox - Driver
  15. 4.
  16. 限制爬取的速度。
  17. 5.
  18. 处理表单中的隐藏域。
  19. - 在读取到隐藏域之前不要提交表单
  20. - RoboBrowser这样的工具辅助提交表单
  21. 6.
  22. 处理表单中的验证码。
  23. -
  24. OCRTesseract - 商业项目一般不考虑
  25. -
  26. 专业识别平台 - 超级鹰 / 云打码
  27. ```python
  28. from hashlib import md5
  29. class ChaoClient(object):
  30. def __init__(self, username, password, soft_id):
  31. self.username = username
  32. password = password.encode('utf-8')
  33. self.password = md5(password).hexdigest()
  34. self.soft_id = soft_id
  35. self.base_params = {
  36. 'user': self.username,
  37. 'pass2': self.password,
  38. 'softid': self.soft_id,
  39. }
  40. self.headers = {
  41. 'Connection': 'Keep-Alive',
  42. 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
  43. }
  44. def post_pic(self, im, codetype):
  45. params = {
  46. 'codetype': codetype,
  47. }
  48. params.update(self.base_params)
  49. files = {'userfile': ('captcha.jpg', im)}
  50. r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
  51. return r.json()
  52. if __name__ == '__main__':
  53. client = ChaoClient('用户名', '密码', '软件ID')
  54. with open('captcha.jpg', 'rb') as file:
  55. print(client.post_pic(file, 1902))
  1. 绕开“陷阱”。

    • 网页上有诱使爬虫爬取的爬取的隐藏链接(陷阱或蜜罐)
    • 通过Selenium+WebDriver+Chrome判断链接是否可见或在可视区域
  2. 隐藏身份。