Python - 第八课：正则表达式 - 《编程语言》

（一）re模块
- （1）match
- （2）group()方法
（二）正则表达式中的单字符匹配
（三）re模块基本函数
（四）贪婪模式与非贪婪模式
- （1）介绍
- （2）举例

（一）re模块

re模块是python中的正则表达模块，其使用格式为：

（1）match

re.match(正则表达式， 带匹配字符串，选项)

re.match尝试从字符串的起始位置匹配一个模式，如果不是起止位置匹配成功的话，match()就返回none.
re.match主要用于正则匹配检查，如果带匹配字符串能够匹配正则表达式，则match方法返回对象，否则返回None
采用从左往右的顺序逐项进行比较
flags 标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等

具体标志位可以使用的修饰符如下（多个标志位可以通过按位OR(|)来指定，例如：re.I | re.M被设置成I和M标志）

修饰符	描述
re.I	使匹配对大小写不敏感
re.L	做本地化识别（local-aware）匹配
re.M	多行匹配，影响^和$，否则只单行
re.S	是.匹配包括换行在内的所有字符
re.U	根据Unicode字符集解析字符，这个标志影响\w \W \B \b
re.X	该标志给与你更灵活的格式以便你讲正则表达式写得更容易理解

（2）group()方法

用来返回字符串的匹配部分

（二）正则表达式中的单字符匹配

（1）单字符匹配的元字符

字符	描述
.	匹配除了”\n”之外的任意单个字符
\d	匹配0到9之间的一个数字，等价于[0-9]
\D	匹配一个非数字字符，等价于[1]
\s	匹配任何空白字符，如空格、制表符”\t”、换行符”\n”等
\S	匹配任意非空白字符
\w	匹配任意单词字符（包括下划线），如：a-z，A-Z，0-9
\W	匹配任意非单词字符，相当于[^a-zA-Z0-9_]
[]	匹配[]中列举的字符
^	取反

例如：

#!/home/python/bin/bin/python3
import re
rs = re.match(".","a")
print(rs.group())
rs = re.match(".","1")
print(rs.group())
rs = re.match("...","abc")
print(rs.group())
rs = re.match(".","\n")
print(rs)
#\s
rs = re.match("\s","\t")
print(rs)
rs = re.match("\s","\n")
print(rs)
rs = re.match("\s"," ")
print(rs)
#\S
rs = re.match("\S","\t")
print(rs)
rs = re.match("\S","abc")
print(rs)
#\w
rs = re.match("\w","a")
print(rs)
rs = re.match("\w","A")
print(rs)
rs = re.match("\w","1")
print(rs)
rs = re.match("\w","_")
print(rs)
rs = re.match("\w","中")
print(rs)
rs = re.match("\w","*") #非单词字符
print(rs)
#[]
rs = re.match("[Hh]","hello")
print(rs)
rs = re.match("[Hh]","Hello")
print(rs)
rs = re.match("[0123456789]","32")
print(rs)
#等价
rs = re.match("[0-9]","3")
print(rs)

运行结果为：

a
1
abc
None
<re.Match object; span=(0, 1), match='\t'>
<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(0, 1), match=' '>
None
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='A'>
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 1), match='_'>
<re.Match object; span=(0, 1), match='中'>
None
<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(0, 1), match='H'>
<re.Match object; span=(0, 1), match='3'>
<re.Match object; span=(0, 1), match='3'>

（2）数量表示的元字符

字符	描述
*	一个字符可以出现任意次，也可以一次都不出现，即从0次到无穷大次
+	一个字符至少出现一次
?	一个字符至多出现一次
{m}	一个字符出现m次
{m,}	一个字符至少出现m次
{,n}	一个字符至多出现n次
{m,n}	一个字符出现m到n次

例如：

#!/home/python/bin/bin/python3
import re
#数量表示方法：
#*任意次
rs = re.match("1\d*","1234567")
print(rs.group())
rs = re.match("1\d*","1234567abc")
print(rs.group())
#+至少出现一次
rs = re.match("\d+","abc")
print(rs)
rs = re.match("\d+","1abc")
print(rs)
rs = re.match("\d+","123345abc")
print(rs)
#?至多1次（0次或者1次）
rs = re.match("\d?","abc")
print(rs)
rs = re.match("\d?","123abc")
print(rs)
#{m}固定次数
rs = re.match("\d{3}","123abc")
print(rs)
#{m,}
rs = re.match("\d{1,}","123467abc")#等价于+至少一次
print(rs)
#{m,n}
rs = re.match("\d{0,1}","abc") #等价于?至多一次
print(rs)
#匹配11位的手机号
#11位，第一位1，第二位3,5,7,8 第3位到第11为0到9的数字
rs = re.match("1[3578]\d{9}","13623198765")
print(rs)
rs = re.match("1[3578]\d{9}","14623198765")#非法手机号
print(rs)
rs = re.match("1[3578]\d{9}","13623198765abc")#非法手机号
print(rs)

运行可得：

1234567
1234567
None
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 6), match='123345'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 3), match='123'>
<re.Match object; span=(0, 6), match='123467'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 11), match='13623198765'>
None
<re.Match object; span=(0, 11), match='13623198765'>

（3）字符串和单词边界的元字符

字符	描述
^	用于匹配一个字符串的开头
$	用于匹配一个字符串的结尾
\b	用于匹配单词的边界
\B	用于匹配费单词的边界

附注：\b匹配这样的位置：它的前一个字符和后一个字符不全是(一个是,一个不是或不存在)\w（匹配字母或数字或下划线）

例如：

#!/home/python/bin/bin/python3
import re
#转义字符处理
str1 = "hello\\world"
print(str1)
str2 = "hello\\\\world"
print(str2)
str3 = r"hello\\world"#原生字符串
print(str3)
rs = re.match("\w{5}\\\\\\\\\w{5}",str3)
print(rs)
rs = re.match(r"\w{5}\\\\\w{5}",str3)
print(rs)
#边界表示
#$字符串结尾
rs = re.match("1[3578]\d{9}$","13623456767")
print(rs.group())
rs = re.match("1[3578]\d{9}$","13623456767abc")
print(rs)
'''
#邮箱匹配
'''
rs = re.match("\w{3,10}@163\.com$","hello_124@163.com")
print(rs)
rs = re.match("\w{3,10}@163.com$","he@163.com")
print(rs)
rs = re.match("\w{3,10}@163\.com$","hello_124@163mcom")
print(rs)
#\b 单词边界
rs = re.match(r".*\bpython\b","hi python hello")
print(rs)
#\B非单词边界
rs = re.match(r".*\Bth\B","hi python hello")
print(rs)

运行结果为：

hello\world
hello\\world
hello\\world
<re.Match object; span=(0, 12), match='hello\\\\world'>
<re.Match object; span=(0, 12), match='hello\\\\world'>
13623456767
None
<re.Match object; span=(0, 17), match='hello_124@163.com'>
None
<re.Match object; span=(0, 9), match='hi python'>
<re.Match object; span=(0, 7), match='hi pyth'>
[python@izbp1f14rcq86rw425zgewz lesson8]$ vim test2.py
[python@izbp1f14rcq86rw425zgewz lesson8]$ ./test2.py 
hello\world
hello\\world
hello\\world
<re.Match object; span=(0, 12), match='hello\\\\world'>
<re.Match object; span=(0, 12), match='hello\\\\world'>
13623456767
None
<re.Match object; span=(0, 17), match='hello_124@163.com'>
None
None
<re.Match object; span=(0, 9), match='hi python'>
<re.Match object; span=(0, 7), match='hi pyth'>

（4）分组匹配元字符

字符	描述
	表示或的意思，匹配
()	将括号中字符作为一个分组
\NUM	配合分组()使用，引用分组NUM(NUM表示分组的编号)对应的匹配规则
(?P)	给分组起别名
(?P=NAME)	应用指定别名的分组匹配到的字符串
(?:pattern)	匹配pattern，但是并不保存在分组结果中，也就是说这是一个非获取匹配，以后不能以指定组的形式去调用。这种形式在使用

例如：

#!/home/python/bin/bin/python3
import re
#()分组
rs = re.match("\w{3,10}@(163|qq|outlook)\.com$","hello@163.com")
print(rs)
rs = re.match("\w{3,10}@(163|qq|outlook)\.com$","1234567@qq.com")
print(rs)
#\num
html_str = "<head><title>python</title></head>"
rs = re.match(r"<.+><.+>.+</.+></.+>",html_str)
print(rs)
html_str2 = "<head><title>python</head></title>"
rs = re.match(r"<.+><.+>.+</.+></.+>",html_str2)
print(rs)
rs = re.match(r"<(.+)><(.+)>.+</\2></\1>",html_str)
print(rs)
rs = re.match(r"<(.+)><(.+)>.+</\2></\1>",html_str2)
print(rs)
rs = re.match(r"<(?P<g1>.+)><(?P<g2>.+)>.+</(?P=g2)></(?P=g1)>",html_str)
print(rs)

运行可得：

<re.Match object; span=(0, 13), match='hello@163.com'>
<re.Match object; span=(0, 14), match='1234567@qq.com'>
<re.Match object; span=(0, 34), match='<head><title>python</title></head>'>
<re.Match object; span=(0, 34), match='<head><title>python</head></title>'>
<re.Match object; span=(0, 34), match='<head><title>python</title></head>'>
None
<re.Match object; span=(0, 34), match='<head><title>python</title></head>'>

关于并不保存在分组结果中的分组的例子：

#!/home/python/bin/bin/python3
import re
#匹配industry或者industries
s="industry"
rs=re.match("industr(?:y|ies)",s)
print(rs)

运行可得：

<re.Match object; span=(0, 8), match='industry'>

（5）前向界定与前向非界定，后向界定与后向非界定

字符	描述
(?=pattern)	look ahead positive assert-向前查找肯定断言，即要求满足查找要求的字符串的后面必须有匹配pattern的字符串。
(?!pattern)	look ahead negative assert-向前查找否定断言，即要求满足查找条件的字符串紧跟字符必须不能匹配pattern
(?<=pattern)	look behind positive assert-向后查找肯定断言，即要求满足查找条件的字符串的前面必须有匹配pattern的字符串。
(?<!pattern)	look behind negative assert-向后查找否定断言，即要求满足查找条件的字符串的前面的字符必须不能匹配pattern。

这里举例如下：

#查找windows，要求windows后面的字符串是95或者98或者NT或者2000
rs = re.match("windows(?=95|98|NT|2000)","windows2000")
print(rs)
#查找windows，要求windows后面的字符串不能是95或者98或者NT或者2000
rs=re.match("windows(?!95|98|NT|2000)","windows3.1")
print(rs)
#查找windows，要求windows前面的字符串是2000
rs=re.search("(?<=2000)windows", "882000windows")
print(rs)
#查找windows，要求windows前面的字符串不能是2000
rs=re.search("(?<!2000)windows","abcwindows")
print(rs)

运行结果为：

<re.Match object; span=(0, 7), match='windows'>
<re.Match object; span=(0, 7), match='windows'>
<re.Match object; span=(6, 13), match='windows'>
<re.Match object; span=(3, 10), match='windows'>

注意：后向查找（look behind assert）要求pattern的宽度固定，因此以下的表达式非法的：

#正则宽度为2、3、4，不固定，错误
rs=re.search("(?<=2000|NT|985)windows", "882000windows")

运行会报错

raise error("look-behind requires fixed-width pattern")

而以下表达就是合法的，正确的

rs = re.search('(?<=2000|3000|4000)windows','2000windows')

运行可得：

<re.Match object; span=(4, 11), match='windows'>

（6）正则表达式的运算符优先级别

正则表达式从左到右进行运算，并遵循优先级顺序，这与算术表达式非常类似。
相同有衔接的从左到右进行运算，不同优先级的运算先高后低。下表从最高到最低说明了各种正则表达式运算符的优先级顺序。

运算符	描述符
(),(?:),(?=),[]	圆括号或方括号
*,+,?,{n},{n,m}	限定符
^,$,\任何元字符,任何字符	定位点和序列（即：位置和顺序）
\|	替换，“或”操作

（三）re模块基本函数

（1）re.search方法

re.search扫描真个字符串并返回第一个成功的匹配，而re.match必须从开头开始匹配，这点有着显著不同。
函数语法为：

re.search(patern, string, flags=0)

pattern 正则表达式
string 要匹配的字符串
flags 标志位，控制正则表达式的匹配方式

若匹配成功，则返回一个匹配的对象，否则返回None.
我们可以使用group(num)或者groups()匹配对象函数来获取匹配表达式。

匹配对象方法	描述
group(num=0)	匹配的整个表达式的字符串，group()可以一次输入多个组号，在这种情况下它将返回一个包含那些组对应的元组。
groups()	返回一个包含所有小组字符的元组，从1到所含小组号

实例如下：

#!/home/python/bin/bin/python3
import re
#搜索满足条件的正则表达式
rs = re.search('\d\.\S+','1.Sam 2.Smith 3.John 4.Black')
#打印结果
print(rs)
print(rs.group(0))
print("--------------------------")
#搜索满足条件的正则表达式，表达式中有三个分组
rs = re.search('(\d\.\w+\s+)(\d\.\w+\s+)(.*)','1.Sam 2.Smith 3.John 4.Black')
print(rs)
#以每个结果一个分组元素的心事展示结果
print(rs.groups())
#以合并显示的形式展示结果
print(rs.group(0))
#展示第一个分组的匹配结果
print(rs.group(1))
#展示第二个分组的匹配结果
print(rs.group(2))
#展示第三个分组的匹配结果
print(rs.group(3))

运行结果为：

<re.Match object; span=(0, 5), match='1.Sam'>
1.Sam
--------------------------
<re.Match object; span=(0, 28), match='1.Sam 2.Smith 3.John 4.Black'>
('1.Sam ', '2.Smith ', '3.John 4.Black')
1.Sam 2.Smith 3.John 4.Black
1.Sam 
2.Smith 
3.John 4.Black

re.match与re.search的区别

re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败。函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

（2）findall

findall的作用是在目标字符串中查找符合正则规则的结果，并把结果放到一个列表中。如果没有找到匹配的字符串，则返回空列表。

同match和search只匹配一次不同，findall是匹配满足正则表达式的每一个字符串，并把每个结果放入列表中。

其格式如下：

举例如下：

#!/home/python/bin/bin/python3
s='<param name="File" value="${catalina.home}/logs/store/application.log" />'
rs=re.findall(r'\b.*?=',s)
print(rs)
rs=re.findall(r'\bparam\b',s)
print(rs)

运行结果为：

['param name=', 'File" value=']
['param']

可见，所有满足条件的记录都被放入了列表中。

（3）finditer

和findall类似，在字符串中找到正则表达式所匹配的所有子串，并把他们作为一个迭代器返回
格式：

rd.finditer(pattern, string, flags=0)

参数：

参数	描述
string	要匹配的字符串
flags	标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等

例如：

#以下代码试图把字符串中的每个单词单独取出来，放到一个装饰器对象的迭代器中
rs=re.finditer(r'\b\w+?\b','<param name="File" value="${catalina.home}/logs/store/application.log')
#打印出迭代器中的每个对象
for item in rs:
        print(item)

运行结果为：

<re.Match object; span=(1, 6), match='param'>
<re.Match object; span=(7, 11), match='name'>
<re.Match object; span=(13, 17), match='File'>
<re.Match object; span=(19, 24), match='value'>
<re.Match object; span=(28, 36), match='catalina'>
<re.Match object; span=(37, 41), match='home'>
<re.Match object; span=(43, 47), match='logs'>
<re.Match object; span=(48, 53), match='store'>
<re.Match object; span=(54, 65), match='application'>
<re.Match object; span=(66, 69), match='log'>

（4）split

split方法按照能够匹配的子串将字符串分割后返回列表，它的使用形式如下：

re.split(pattern,string[,maxsplit=0, flags=0])

参数:

参数	描述
pattern	匹配的正则表达式
string	要匹配的字符串
maxsplit	分割次数，maxsplit=1分割一次，默认为0，不限制次数
flags	标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等

例如：

s="Long,Long,ago,there,is,a, very,      good,,position"
#指定按,+\W*方式去对字符串进行分割
rs=re.split(",+\W*",s)
print(rs)

运行结果为：

['Long', 'Long', 'ago', 'there', 'is', 'a', 'very', 'good', 'position']

（5）sub —检索和替换

Python中的re模块提供了re.sub用于替换字符串中的匹配项
语法：

re.sub(pattern, repl, string, count=0 ,falgs=0)

参数：

pattern:正则表达式
repl:字符串或者函数，用于替换掉原始字符串中的字符
string:原始字符串，其中部分字符要被替换掉
count:模式匹配后替换的最大次数，默认0表示替换所有的匹配

#!/home/python/bin/bin/python3
import re
phone = "2018-109-109 # 这是一个国外电话号码"
#将注释替换掉
num = re.sub('#.*$',"",phone)
print("The phone number is:{}".format(num))
#将号码中的-替换掉
num=re.sub(r'\D',"",num)
print("The phone number is:{}".format(num))

运行可得：

The phone number is:2018-109-109 
The phone number is:2018109109

re.sub方法允许把函数作为参数传入repl位置（第二个参数），这个时候，sub方法将把同pattern匹配的每一个match对象作为参数传递给repl位置的函数，例如：

#!/home/python/bin/bin/python3
import re
#将字符数字乘以2，返回字符型数字
def double_number(matched):
        value = int(matched.group('value'))
        return str(value*2)
number = '12abcU890KK972L1'
#对指定表达式进行正则搜索，找出每一个数字，并将数字乘以2，然后返回
rs = re.sub('(?P<value>\d+)',double_number, number)
print(rs)

运行结果为：

24abcU1780KK1944L2

（四）贪婪模式与非贪婪模式

（1）介绍

贪婪模式
正则表达式引擎默认是贪婪模式，尽可能多的匹配字符
非贪婪模式
- 与贪婪模式相反，尽可能少的匹配字符
- 在表示数量的“*”，“？”，“+”，“{m,n}”符号后面加上？，使得贪婪变成非贪婪

（2）举例

在贪婪模式下：

#!/home/python/bin/bin/python3
import re
rs = re.findall(r"hello\d*","hello12345")
print(rs)
rs = re.findall(r"hello\d+","hello12345")
print(rs)
rs = re.findall(r"hello\d?","hello12345")
print(rs)
rs = re.findall(r"hello\d{2,}","hello12345")
print(rs)
rs = re.findall(r"hello\d{1,3}","hello12345")
print(rs)

运行结果为：

['hello12345']
['hello12345']
['hello1']
['hello12345']
['hello123']

而在非贪婪模式下

#!/home/python/bin/bin/python3
import re
#在数量*，？，+，{m,n}后面加？则为非贪婪模式
rs = re.findall(r"hello\d*?","hello12345")
print(rs)
rs = re.findall(r"hello\d+?","hello12345")
print(rs)
rs = re.findall(r"hello\d??","hello12345")
print(rs)
rs = re.findall(r"hello\d{2,}?","hello12345")
print(rs)
rs = re.findall(r"hello\d{1,3}?","hello12345")
print(rs)

运行结果为：

['hello']
['hello1']
['hello']
['hello12']
['hello1']

由此可以见，在非贪婪模式下，实现了字符的最少匹配，能少匹配尽量少匹配，皆取下限

0-9 ↩︎