9 正则匹配 - 《Python正则手册》

基本函数
re.MatchObject 对象
匹配手机号码
匹配邮箱地址
匹配时引用分组

正则匹配即查找并返回一个匹配项。

基本函数

re模块完成正则匹配功能的函数有3个：

search：从字符串任意位置开始匹配，返回第一个匹配成功的对象，匹配失败函数返回None
match：从字符串开头开始匹配，匹配失败函数返回None

fullmatch：整个字符串与正则完全匹配

它们的参数均为：

re.xxx(pattern, string, flags=0)

search方法从字符串任意位置开始查找，适配性最强，可以通过加入 ^ 匹配开头达到跟match相同的效果，match也可以通过加入 $ 匹配结尾达到跟fullmatch相同的效果。首先测试一下search：

print(re.search('www', 'www.taobao.com'))
print(re.search('com', 'www.taobao.com'))

<re.Match object; span=(0, 3), match='www'>
<re.Match object; span=(11, 14), match='com'>

测试match：

print(re.match('www', 'www.taobao.com'))
print(re.match('com', 'www.taobao.com'))

<re.Match object; span=(0, 3), match='www'>
None

最后测试fullmatch：

print(re.fullmatch('www', 'www.taobao.com'))
print(re.fullmatch('com', 'www.taobao.com'))
print(re.fullmatch('www.taobao.com', 'www.taobao.com'))

None
None
<re.Match object; span=(0, 14), match='www.taobao.com'>

从上述结果中，我们可以清晰的看到search、match和fullmatch三者的区别：由于’com’在字符串’www.taobao.com’的末尾，所以match函数未匹配到任何结果返回None；而 fullmatch函数由于是匹配整个字符串，所以’www’匹配’www.taobao.com’时也返回None。

re.MatchObject 对象

同时可以看到，它们均返回了一个 re.Match 对象，该对象提供了group(num) 和 groups()方法， group(num) 用于返回对应分组编号的数据，groups()方法用于返回所有分组的数据，而lastindex属性可以获取分组的个数。
示例：

line = "Cats are smarter than dogs"
matchObj = re.match(r'(.*?) are (.*?) ', line, re.I)
if matchObj:
 print("总分组数：", matchObj.lastindex)
 print("所有分组的数据：",matchObj.groups())
 print("整个被匹配的字符串 : ", matchObj.group())
 print("第1个分组的数据 : ", matchObj.group(1))
 print("第2个分组的数据 : ", matchObj.group(2))
else:
print("No match!!")

结果：

总分组数： 2
所有分组的数据： ('Cats', 'smarter')
整个被匹配的字符串 : Cats are smarter
第1个分组的数据 : Cats
第2个分组的数据 : smarter

re.MatchObject 对象的其他方法：

start() 返回匹配开始的位置 -
end() 返回匹配结束的位置

span() 返回一个元组包含匹配 (开始,结束) 的位置

匹配手机号码

目前主要的手机号前三位是：

中国电信号段：133，153， 180，181，189，173， 177，149
中国联通号段：130，131，132，155，156，185，186，145，176，185
中国移动号段：134，135，136，137，138，139，150，151，152，157，158，159，182，
183，184，147，178

规律是：

第一位 ：1
第二位：3，4，5，7，8
第三位：根据第二位来确定
3 + 【0-9】
4 + 【5，7，9】
5 + 【0-9】！4
7 + 【0-9】！ 4和9
8 + 【0-9】

对手机号比较粗略的匹配（11位数字，前2位符合手机号规则）：

"1[34578]\d{9}"

较为精确的匹配（11位数字，前3位符合手机号规则）：

"1(?:[38]\d|4[579]|5[0-35-9]|7[0-35-8])\d{8}"

测试较为精确匹配：
```python import re import random

def random_number(nums): result = “” for x in range(nums): result += str(random.randint(0, 9)) return result nums = [ 133, 153, 180, 181, 189, 173, 177, 149, 130, 131, 132, 155, 156, 185, 186, 145, 176, 185, 134, 135, 136, 137, 138, 139, 150, 151, 152, 157, 158, 159, 182, 183, 184, 147, 178 ] for num in nums: print(re.fullmatch(“1([38]\d|4[579]|5[0-35-9]|7[0-35-8])\d{8}”, f”{num}{random_number(8)}”).group(), end=”,”)

 结果：  
```python
13320640138,15352178619,18010467102,18124689139,18975065050,17380280568,17798
275371,14994833499,13068873816,13151192893,13289047370,15594125464,1564821694
0,18574982445,18643788553,14516397708,17616874062,18559031583,13443533383,135
96265766,13629806068,13745249866,13896644123,13954817486,15076523907,15182868
824,15229880699,15794102747,15852468936,15938064514,18297190705,18304331736,1
8402303981,14751356440,17847872471,

全部成功匹配上。测试一个错误的电话号码：

pattern = "1([38]\d|4[579]|5[0-35-9]|7[0-35-8])\d{8}"
tel_number = "15452468936"
print(re.fullmatch(pattern, tel_number))

结果：

None

也成功的匹配失败。

匹配邮箱地址

邮箱地址的规则是： user@mail.server.name，即名称+@+网站常见的邮箱地址一般都是@xxx.com，但也还包括一些特殊邮箱地址，能到三级域名甚至四级域名，例如：

@SEED.NET.TW @TOPMARKEPLG.COM.TW @wilnetonline.net @cal3.vsnl.net.in

当然这只是少部分，大部分都是二级域名，但我们不能因此让这些域名匹配不成功。例如，我们认为下面的邮箱地址都是合法的邮箱地址：

emails = [
    'someone@gmail.com',
    'bill.gates@microsoft.com',
    'mr-bob@example.com',
    'someone@SEED.NET.TW',
    'chuck.gt@cal3.vsnl.net.in'
]

正则匹配规则可以写为：

r"[a-z.-]+@[[a-zA-Z0-9]+(\.[a-zA-Z]+){1,3}"

测试：

for email in emails:
    match_obj = re.match(r"[a-z.-]+@[[a-zA-Z0-9]+(\.[a-zA-Z]+){1,3}", email,re.I)
if match_obj:
    print(match_obj.group(0))

结果：

someone@gmail.com
bill.gates@microsoft.com
mr-bob@example.com
someone@SEED.NET.TW
chuck.gt@cal3.vsnl.net.in

再顺便测试一个不是邮箱的字符串：

print(re.match(r"[a-z.-]+@[[a-zA-Z0-9]+(\.[a-zA-Z]+){1,3}",
'bob#example.com', re.I))

未通过校验，打印结果为None。

匹配时引用分组

前面的正则匹配规则表中说过：
\ 表示引用编号为 \ 的分组匹配到的字符串
示例
IT后台有一批用户名和密码的字符串，部门希望找出那些将密码设置的跟用户名一样的用户提醒他们修改密码：

users = [
"user1:password",
"user2:user2",
"user3:password",
"user4:password",
"user5:password",
"user6:user6",
"user7:password",
"user8:user8",
"user9:password",
"user10:user10"
]

这时，在匹配时引用分组就会非常方便：

for user in users:
    match_obj = re.match(r"(\w+):\1", user, re.A)
    if match_obj:
    print(match_obj.group(1))

结果：

user2
user6
user8
user10

可以看到顺利的提取出了，用户名和密码一致的用户。