本文以前面的ip138爬虫为例,爬虫程序不再赘述。

方法一:修改配置文件

可以通过修改settings.py配置文件,很容易地为爬虫程序设置头部数据

修改User-Agent:

  1. USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'

修改headers:

  1. DEFAULT_REQUEST_HEADERS = {
  2. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
  3. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  4. "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
  5. "Accept-Encoding": "gzip, deflate",
  6. 'Content-Length': '0',
  7. "Connection": "keep-alive"
  8. }

方法二:在爬虫程序中设置

除了可以在 settings.py 中设置,我们也可以为爬虫单独设置headers,例如:

  1. class Ip138Spider(scrapy.Spider):
  2. custom_settings = {
  3. 'DEFAULT_REQUEST_HEADERS' : {
  4. 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
  5. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  6. "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
  7. "Accept-Encoding": "gzip, deflate",
  8. 'Content-Length': '0',
  9. "Connection": "keep-alive"
  10. }
  11. }

还可以在请求的时候设置:

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class Ip138Spider(scrapy.Spider):
  4. name = 'ip138'
  5. allowed_domains = ['www.ip138.com','2018.ip138.com']
  6. start_urls = ['http://2018.ip138.com/ic.asp']
  7. headers = {
  8. 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
  9. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  10. "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
  11. "Accept-Encoding": "gzip, deflate",
  12. 'Content-Length': '0',
  13. "Connection": "keep-alive"
  14. }
  15. def start_requests(self):
  16. for url in self.start_urls:
  17. yield scrapy.Request(url, headers=self.headers ,callback=self.parse)
  18. def parse(self, response):
  19. print("*" * 40)
  20. print("response text: %s" % response.text)
  21. print("response headers: %s" % response.headers)
  22. print("response meta: %s" % response.meta)
  23. print("request headers: %s" % response.request.headers)
  24. print("request cookies: %s" % response.request.cookies)
  25. print("request meta: %s" % response.request.meta)

方法三:在中间件中设置

我们可以在DownloaderMiddleware中间件中设置headers

  1. class TutorialDownloaderMiddleware(object):
  2. def process_request(self, request, spider):
  3. request.headers.setdefault('User-Agent', 'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5')

当然我们也可以设置完整的headers

  1. from scrapy.http.headers import Headers
  2. headers = {
  3. 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
  4. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  5. "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
  6. "Accept-Encoding": "gzip, deflate",
  7. 'Content-Length': '0',
  8. "Connection": "keep-alive"
  9. }
  10. class TutorialDownloaderMiddleware(object):
  11. def process_request(self, request, spider):
  12. request.headers = Headers(headers)

settings.py中启动中间件

  1. # Enable or disable downloader middlewares
  2. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  3. DOWNLOADER_MIDDLEWARES = {
  4. 'tutorial.middlewares.TutorialDownloaderMiddleware': 1,
  5. }

动态UserAgent

如果想要设置动态UserAgent,可以通过在settings.py中配置一个UserAgent列表,然后在中间件中引入,随机取出一个UserAgent。

具体实现如下

settings.py

  1. USER_AGENT_LIST=[
  2. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
  3. "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
  4. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
  5. "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
  6. "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
  7. "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
  8. "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
  9. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  10. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  11. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
  12. "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  13. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  14. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
  15. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  16. "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  17. "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
  18. "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
  19. "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  20. ]

middlewares.py

  1. from tutorial.settings import USER_AGENT_LIST
  2. class TutorialDownloaderMiddleware(object):
  3. def process_request(self, request, spider):
  4. request.headers.setdefault('User-Agent', random.choice(USER_AGENT_LIST))

方法四:使用fake-useragent切换User-Agent

安装fake-useragent

  1. pip install fake-useragent

具体实现如下

middlewares.py

  1. from fake_useragent import UserAgent
  2. class RandomUserAgentMiddleware(object):
  3. def __init__(self,crawler):
  4. super(RandomUserAgentMiddleware, self).__init__()
  5. self.ua = UserAgent()
  6. self.ua_type = crawler.settings.get('RANDOM_UA_TYPE','random')
  7. @classmethod
  8. def from_crawler(cls,crawler):
  9. return cls(crawler)
  10. def process_request(self,request,spider):
  11. def get_ua():
  12. return getattr(self.ua,self.ua_type)
  13. request.headers.setdefault('User-Agent',get_ua())

settings.py中启用中间件

  1. RANDOM_UA_TYPE= 'random'
  2. DOWNLOADER_MIDDLEWARES = {
  3. 'tutorial.middlewares.RandomUserAgentMiddleware': 543,
  4. }

参考资料