Spider - 爬虫技术基础知识 - 《Yao Yinnan's Blog》

title: 爬虫技术基础知识date: 2018-05-03 15:50:51
tags: spider
技术选型：scrapy+requests
爬虫作用
正则表达式
网站的树结构
深度优先
广度优先(宽度优先)
爬虫去重策略
字符串编码

title: 爬虫技术基础知识date: 2018-05-03 15:50:51
tags: spider

技术选型：scrapy+requests

1.scrapy是框架，requests和beautifulsoup是库

2.scrapy框架中可以加入requests和beautifulsoup

3.scrapy基于twisted，性能是最大的优势

4.scrapy方便扩展，提供很多内置功能

5.scrapy内置css和xpath selector非常方便，而beautifulsoup缺点是慢

爬虫作用

1.搜索引擎—-百度，google，垂直领域搜索引擎（有目的和对应领域的搜索）

2.推荐引擎—-今日头条

3.机器学习的数据样本

4.数据分析（金融、舆情、医疗等）

正则表达式

提取字符串重要部分，验证字符串是否符合某规则，等。

1.特殊字符

1）^ $ * ? + {2} {2,} {2,5} |

2） [] [^] [a-z] .

3）\s \S \w \W

4）[\u4E00-\u9FA5] () \d

详细解释：

1）^ $ * ? + {2} {2,} {2,5} |

^ 必须以某一字符开头

$ 必须以某一字符结尾 eg：3$

. 匹配除了换行符（\n）以外的任意一个字符

前面的字符可以重复任意多遍 (>=0)

.* 任意一个字符串

re.match() 验证字符串str是否匹配正则regex_str

re.match(regex_str, str)

line = "bobby123"
regex_str = "^b.*3$"
match_obj = re.match(regex_str, line)
if re.match(regex_str, line):
    print("yes")
else:
    print("no")

() 提取内容

group(0) 返回括号提取的内容的列表

group(1) 返回括号提取的内容的第一个

? 改变匹配模式为非贪婪匹配

贪婪匹配：正则表达式一般趋向于最大长度匹配。

非贪婪匹配：就是匹配到结果就好，就少的匹配字符。

line = "booooooooobby123"
regex_str1 = "^.*(b.*b).*"        # output: bb
regex_str2 = "^.*?(b.*b).*"        # output: booooooooobb
regex_str3 = "^.*?(b.*?b).*"    # output: booooooooob
match_obj = re.match(regex_str1, line)
if match_obj:
    print(match_obj.group(1))

某字符出现至少一次 (>=1)

line = "booooooooobbaaby123"
regex_str = ".*(b.+b).*"        # output: baab
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

{2} 两个b之间出现的任意字符为两个的情况

line = "booooooooobbbaaaaaaby123"
regex_str = ".*(b.{2}b).*"        # output: baab
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

{2，} 两个b之间出现的任意字符为两个以上的情况

line = "booooooooobbbaaaby123"
regex_str = ".*(b.{2,}b).*"        # output: baaab
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

{2,5} 两个b之间出现的任意字符为两个到五个之间的情况

line = "booooooooobbbaaaaby123"
regex_str1 = ".*(b.{2,5}b).*"    # output: baaaab
regex_str2 = ".*(b.{2,3}b).*"    # output: 
match_obj = re.match(regex_str1, line)
if match_obj:
    print(match_obj.group(1))

| 满足两种情况之一

line = "bobby123"
regex_str = "((bobby|boobby)123)" # output：bobby123
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

2）[] [^] [a-z] .

[] eg:[234] 满足2、3、4中任意一个即可

[0-9] 满足在0-9之间任意一个数

eg：验证是否为手机号码

line = "15222222222"
regex_str = "(1[3578][0-9]{9})"     # output: 15222222222
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

[^1] “非”，不为1才满足

line = "15ssssssss2"
regex_str1 = "(1[3578][^1]{9})"        # output: 15ssssssss2
regex_str2 = "(1[3578][^2]{9})"        # output:
match_obj = re.match(regex_str1, line)
if match_obj:
    print(match_obj.group(1))

注意：[] 中的 . 都没有特殊含义，如 [.] 只是单纯的表示 .* 之一出现即可

3）\s \S \w \W

\s 包括空格、制表符、换页符等空白字符的其中任意一个

line = "你 好"
regex_str = "(你\s好)"      # output: 你 好
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

\S 与\s相反，只要是不为\s包括的任意一个字符即可，无法识别多个（若想识别，加上”+”即可）。

line = "你ss好"
regex_str = "(你\S+好)"     # output: 你ss好
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

\w 任意一个字母或数字或下划线，也就是 Az,0~9,_ 中任意一个

\W和\D 与\w和\d相反

4）[\u4E00-\u9FA5] () \d

[\u4E00-\u9FA5] 汉字编码，只要为汉字即可

line = "study in 清华大学"
regex_str = ".*?([\u4E00-\u9FA5]+大学)"     # output: 清华大学
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

\d 任意一个数字，0~9 中的任意一个

line = "XXX出生于2001年"
regex_str1 = ".*(\d{4})年"      # output: 2001
regex_str2 = ".*?(\d+)年"      # output: 2001
match_obj = re.match(regex_str1, line)
if match_obj:
    print(match_obj.group(1))

实例：

line = "XXX出生于2001年6月1日"
line = "XXX出生于2001/6/1"
line = "XXX出生于2001-06-01"
line = "XXX出生于2001-06"
regex_str = ".*出生于(\d{4}[年/-]\d{1,2}([月/-]\d{1,2}|[月/-]$|$))"  # output: 2001
match_obj = re.match(regex_str, line)
if match_obj:
    print(match_obj.group(1))

网站的树结构

爬虫技术基础知识 - 图1

实际情况——网络url有环路

爬虫技术基础知识 - 图2

跳过循环手段——去重

将爬过的页面放在一个list中，进行记录，再次爬去到url的时候去检验是否爬过。

深度优先

爬虫技术基础知识 - 图3

从初始开始，以左优先，沿着子节点到最深，返回，到兄弟结点，以此类推。

A—>B—>D—>E—>I—>C—>F—>G—>H（递归实现）

scrapy默认通过深度优先算法实现。

爬虫技术基础知识 - 图4

广度优先(宽度优先)

爬虫技术基础知识 - 图5

从初始开始，以左优先，一层一层兄弟节点访问。也叫宽度优先算法。

A—>B—>C—>D—>E—>F—>G—>H—>I（队列实现）

爬虫技术基础知识 - 图6

爬虫去重策略

1.将访问过的url保存到数据库中（效率低）

2.将访问过的url保存到set（内存）中，只需要O(1)的代价就可以查询url（占用内存非常大）

100000000 _ 2byte _ 50个字符/1024/1024/1204 = 9Gb

3.url经过md5等方法哈希后保存到set中（scrapy所采用的）

将任意长度的url压缩到一样长度的md5字符串，每个占16byte

4.用bitmap方法，将访问过的url通过hash函数映射到某一位（冲突高，多个url映射到同一个位置）

映射在一个bit位置

5.bloomfilter方法对bitmap进行改进，多重hash函数降低冲突

100000000 * 1bit / 8 1024/1024/1024 = 12Mb

字符串编码

爬虫技术基础知识 - 图7

爬虫技术基础知识 - 图8

unicode：内存处理简单，占用空间大

utf-8：处理复杂，占用空间方式灵活

调节：爬虫技术基础知识 - 图9

在使用encode()和decode()时需要进行转换：（python2）

windows:使用的是gb2313编码

s.decode("gb2312").encode("utf-8")

unix：使用的是utf-8编码

s.decode("utf-8").encode("utf-8")

疑问：为何本身是utf-8还需要转化？

解答：unix下默认编码格式：

sys.getdefaultencoding()

output: ‘ascii’

所以要先用decode转化为utf-8

在python3中，默认使用unicode编码，故可以直接使用encode(“utf-8”)

爬虫技术基础知识

title: 爬虫技术基础知识date: 2018-05-03 15:50:51tags: spider