📁 Scrapy爬虫实战 - 📃 在爬虫程序中设置代理 - 《小昱的Python爬虫笔记》

方法一：在meta中设置
方法二：在中间件中设置
参考资料

方法一：在meta中设置

我们可以直接在自己具体的爬虫程序中设置proxy字段，直接在构造Request里面加上meta字段即可：

# -*- coding: utf-8 -*-
import scrapy
class Ip138Spider(scrapy.Spider):
    name = 'ip138'
    allowed_domains = ['ip138.com']
    start_urls = ['http://2020.ip138.com']
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={'proxy': 'http://163.125.69.29:8888'}, callback=self.parse)
    def parse(self, response):
        print("response text: %s" % response.text)
        print("response headers: %s" % response.headers)
        print("response meta: %s" % response.meta)
        print("request headers: %s" % response.request.headers)
        print("request cookies: %s" % response.request.cookies)
        print("request meta: %s" % response.request.meta)

方法二：在中间件中设置

中间件middlewares.py的写法如下：

# -*- coding: utf-8 -*-
class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = "http://proxy.your_proxy:8888"

这里有两个问题:

一是proxy一定是要写号http://前缀的否则会出现to_bytes must receive a unicode, str or bytes object, got NoneType的错误。
二是官方文档中写到process_request方法一定要返回request对象，response对象或None的一种，但是其实写的时候不用return

另外如果代理有用户名密码等就需要在后面再加上一些内容:

# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

如果想要配置多个代理，可以在配置文件中添加一个代理列表：

PROXIES = [
    '163.125.69.29:8888'
]

然后在中间件中引入：

# -*- coding: utf-8 -*-
import random
from ip138_proxy.settings import PROXIES
class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = "http://%s" % random.choice(PROXIES)
        return None

在settings.py的DOWNLOADER_MIDDLEWARES中开启中间件：

DOWNLOADER_MIDDLEWARES = {
    'myCrawler.middlewares.ProxyMiddleware': 1,
}

运行程序，可以看到设置的代理生效了：

📃 在爬虫程序中设置代理 - 图1

如果想要找到一些免费的代理可以到快代理中寻找。

参考资料

Scrapy爬虫实战：使用代理访问