7、scrapy框架 - 7.4 scrapy模拟登录 - 《Python爬虫》

1、scrapy模拟登录
2、登录人人网
3、scrapy模拟登陆之发送post请求
4 scrapy模拟登陆之自动登录
- 4.1 如何发送？

1、scrapy模拟登录

为什么需要模拟登录？
获取cookie，能够爬取登录后的页面

回顾：
request是如何模拟登录的？

1 直接携带cookies请求页面
2 找接口发送post请求存储cookie

selenium是如何模拟登录的？

找到对应的input标签，输入文字点击登录

那么对于scrapy来说，也是有两个方法模拟登录：

1 直接携带cookie，需要重写start_requests方法（定义cookie的值），然后发送请求yield scrapy.Request(url=, cookies= ,callback=)；headers中添加cookie的无法生效；
2 找到发送post请求的URL地址，带上信息，发送请求

2、登录人人网

直接携带cookie登陆

import scrapy
class RrSpider(scrapy.Spider):
    name = 'rr'
    allowed_domains = ['renren.com']
    start_urls = ['http://www.renren.com/974376660/profile']
    def start_requests(self):
        url = self.start_urls[0]
        cookie_str = 'anonymid=k5819yjj-v87pet; _r01_=1; taihe_bi_sdk_uid=b2d2177e744b29a034cf8c5106c8d219; ln_uact=13415662175; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; _de=C4463AE8B4546175118A454C43DCC81C; depovince=ZGQT; jebecookies=6dfb56c2-aa43-4175-8b97-1fd4e3bf780a|||||; taihe_bi_sdk_session=cc2cc891cf727d3774964b40b9ec51d3; ick_login=9113ac97-37ac-4032-a411-a8a63fa7f725; p=939f755d6bc277295658b9e680c366c60; first_login_flag=1; t=0dae2e967bf60a114fe6382b71e9dd540; societyguester=0dae2e967bf60a114fe6382b71e9dd540; id=974376660; xnsid=caf49547; ver=7.0; loginfrom=null; wp_fold=0; wpsid=15830122379353'
        # headers = {
        #     "Cookie": cookie_str
        # }
        cookies = {i.split("=")[0]: i.split("=")[1] for i in cookie_str.split("; ")}
        yield scrapy.Request(
            url=url,
            # headers=headers,
            callback=self.parse,
            cookies=cookies
        )
    def parse(self, response):
        print(response.body.decode())
        with open('人人.html', 'w', encoding='utf-8') as f:
            f.write(response.body.decode())

3、scrapy模拟登陆之发送post请求

3.1 如何发送？

使用yield scrapy.FormRequest(url= , formdata= , callback= )来方式post请求

url：表单要提交的地址；
formdata：要提交的表单数据；
callback：回调函数，表单提交完成后，要执行的方法；

3.2 formdata中含有可变参数时，如何寻找出其变化的规律？

在网页源代码中查找是否有无该参数的value，有则可以通过请求url，xpath获取该参数的值；
不在页面源代码中，看是否有无规律，比如+1之类的，或者看来源页中有无响应的参数值(一般列表页跳转到详情页时，列表页中往往含有跳转详情页所需的参数)；
在js中寻找，需要非常熟悉js代码，有相应的搜索经验（暂时还不会
注意：formdata中有些参数哪怕是变化的，但其实不添加该参数，对表单的提交并不影响，比如时间戳、github中的ga_id（别的参数在页面中都有value，唯独它没有，所以可以尝试去掉）

3.3 登录GitHub
这里有一个坑就是，使用接口来第一次登入时，会重定向到填写一个验证码的地址，当有验证码时需要另外提交验证码的表单，这里要找出验证码提交的接口，以及提交的表单数据，来重新构造post请求；

找重定向验证码接口方法：
方法1：将重定向后的响应保存为本地的html文件，在本地打开，随便输入一个验证码，找到该请求提交的地址以及参数；需要注意的是，因为打开的是一个本地的html文件，所以域名是localhost，需要将localhost改成github.com，这样才是验证表单要提交的地址；

方法2：使用selenium去自动登入，一般请求接口需要验证码，selenium登入也会需要验证码，当重定向到验证码的页面时，同样随便输入一个验证码，找到该请求提交的地址以及参数；

import scrapy
class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']
    def parse(self, response):
        print('try login...')
        authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
        timestamp = response.xpath("//input[@name='timestamp']/@value").extract_first()
        timestamp_secret = response.xpath("//input[@name='timestamp_secret']/@value").extract_first()
        formdata = {
            "commit": "Sign in",
            "authenticity_token": authenticity_token,
            # "ga_id": "9383650.1568103844",
            "login": "",
            "password": "",
            "webauthn-support": "supported",
            "webauthn-iuvpaa-support": "unsupported",
            "timestamp": timestamp,
            "timestamp_secret": timestamp_secret
        }
        yield scrapy.FormRequest(
            url='https://github.com/session',
            callback=self.after_login,
            formdata=formdata
        )
    def after_login(self, response):
        print(response.url)
        if response.url == "https://github.com/sessions/verified-device":
            verified_url = "https://github.com/sessions/verified-device"
            authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
            otp = input("please input your verified_code: ")
            print(otp, authenticity_token)
            formdata = {
                "authenticity_token": authenticity_token,
                "otp": otp
            }
            yield scrapy.FormRequest(
                url=verified_url,
                formdata=formdata,
                callback=self.after_login
            )
        with open('github.html', 'w', encoding='utf-8') as f:
            f.write(response.text)
        if response.url == 'https://github.com/':
            print('login successful!')

4 scrapy模拟登陆之自动登录

Scrapy 对于表单请求，FormRequest 还提供了另外一个方法 from_response 来自动获取页面中的表单，我们只需要传入用户名和密码就可以发送请求。

4.1 如何发送？

使用yield scrapy.FormRequest.from_response(response= , formdata= , callback= )来方式post请求

response：登入页面的响应；
formdata：要提交的表单数据，这里只需要提交用户名和密码，因为其他参数的值，可以直接从页面的表单中获取；
callback：回调函数，表单提交完成后，要执行的方法； ```python import scrapy

class Github2Spider(scrapy.Spider): name = ‘github2’ allowed_domains = [‘github.com’] start_urls = [‘https://github.com/login‘]

def parse(self, response):
    authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
    timestamp = response.xpath("//input[@name='timestamp']/@value").extract_first()
    timestamp_secret = response.xpath("//input[@name='timestamp_secret']/@value").extract_first()
    formdata = {
        "login": "xiaobink96",
        "password": ""
    }
    yield scrapy.FormRequest.from_response(
        response=response,
        callback=self.after_login,
        formdata=formdata
    )
def after_login(self, response):
    print(response.url)
    with open('github2.html', 'w', encoding='utf-8') as f:
        f.write(response.text)

```

7.4 scrapy模拟登录