我们先分析下 https://www.ip138.com/ 这个网站,它可以获取到我们的IP及所在区域:

📃 伪装headers构造假IP - 图1

分析可知,其嵌套一个iframe,将 2020.ip138.com 的内容嵌入。

我们的目的是伪造一个headers,骗过此网站,让其解析伪造的IP。

开始编码

首先我们编写一个正常的爬虫,返回我们当前的IP:

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class Ip138Spider(scrapy.Spider):
  4. name = 'ip138'
  5. allowed_domains = ['ip138.com']
  6. start_urls = ['http://2020.ip138.com']
  7. def parse(self, response):
  8. print("=" * 40)
  9. print(response.css('p::text').extract_first())
  10. print("=" * 40)

在未伪造headers的时候,启动爬虫程序,会正常返回我的ip及所在地区

📃 伪装headers构造假IP - 图2

伪造IP

首先我们创建一个工具类,提供要给方法,用于伪造headers信息:

  1. #! /usr/bin/env python3
  2. # -*- coding:utf-8 -*-
  3. import random
  4. from ip138_fake_headers.settings import USER_AGENT_LIST
  5. class Utils(object):
  6. @staticmethod
  7. def get_header(host, ip=None):
  8. if ip is None:
  9. ip = str(
  10. '%s.%s.%s.%s' % (
  11. random.choice(list(range(255))),
  12. random.choice(list(range(255))),
  13. random.choice(list(range(255))),
  14. random.choice(list(range(255)))
  15. )
  16. )
  17. return {
  18. 'Host': host,
  19. 'User-Agent': random.choice(USER_AGENT_LIST),
  20. 'server-addr': '',
  21. 'remote_user': '',
  22. 'X-Client-IP': ip,
  23. 'X-Remote-IP': ip,
  24. 'X-Remote-Addr': ip,
  25. 'X-Originating-IP': ip,
  26. 'x-forwarded-for': ip,
  27. 'Origin': 'http://' + host,
  28. "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  29. "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
  30. "Accept-Encoding": "gzip, deflate",
  31. "Referer": "http://" + host + "/",
  32. 'Content-Length': '0',
  33. "Connection": "keep-alive"
  34. }

这里引入了配置文件,我们需要在settings.py中配置USER_AGENT_LIST:

  1. USER_AGENT_LIST=[
  2. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
  3. "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
  4. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
  5. "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
  6. "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
  7. "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
  8. "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
  9. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  10. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  11. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
  12. "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  13. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  14. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
  15. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  16. "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  17. "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
  18. "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
  19. "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  20. ]

然后配置中间件:

  1. # -*- coding: utf-8 -*-
  2. from ip138_fake_headers.libs.utils import Utils
  3. from scrapy.http.headers import Headers
  4. class Ip138FakeHeadersDownloaderMiddleware(object):
  5. def process_request(self, request, spider):
  6. request.headers = Headers(Utils.get_header('2020.ip138.com'))

在配置文件中启用中间件:

  1. ROBOTSTXT_OBEY = False
  2. DOWNLOADER_MIDDLEWARES = {
  3. 'ip138_fake_headers.middlewares.Ip138FakeHeadersDownloaderMiddleware': 1,
  4. }

启动爬虫程序,可以看到,伪造IP成功:

📃 伪装headers构造假IP - 图3

参考资料