Simple Metacharacters(元字符)
Metacharacters
Metacharacters are what make regular expressions more powerful than normal string methods.
They allow you to create regular expressions to represent concepts like “one or more repetitions of a vowel”.(像元音的一次和多次重复)
The existence of metacharacters poses a problem if you want to create a regular expression (or regex) that matches a literal metacharacter, such as “$”. You can do this by escaping the metacharacters by putting a backslash in front of them.
However, this can cause problems, since backslashes also have an escaping function in normal Python strings. This can mean putting three or four backslashes in a row to do all the escaping.
!To avoid this, you can use a raw string, which is a normal string with an “r” in front of it. We saw usage of raw strings in the previous lesson.
加r表示非转义的原始字符串,即“\”作为字符串读入
- The first metacharacter we will look at is . (dot).
This matches any character, other than a new line.
import re
pattern = r"gr.y"
if re.match(pattern, "grey"):
print("Match 1")
if re.match(pattern, "gray"):
print("Match 2")
if re.match(pattern, "blue"):
print("Match 3")
结果只输出1和2,可见“.”的作用是表示一个任意字符
- The next two metacharacters are ^ and $.
These match the start and end of a string, respectively.
pattern = r"^gr.y$"
if re.match(pattern, "grey"):
print("Match 1")
if re.match(pattern, "gray"):
print("Match 2")
if re.match(pattern, "agrsy"):
print("Match 3")
“^”: 匹配开始位置,即匹配以^之后的字符为开始
“$”: 匹配结束位置,即匹配输入字符串的结束位置
注意:re.mach只匹配字符串的开始
The pattern “^gr.y$“ means that the string should start with gr, then follow with any character, except a newline, and end with y.
Character Classes(字符类)
Character classes provide a way to match only one of a specific set of characters.(匹配指定字符集中的一个)
A character class is created by putting the characters it matches inside square brackets.(使用方括号表示)
pattern = r"[aeiou]"
if re.search(pattern,"grey"):
print("Match 1")
if re.search(pattern,"qwertyuiop"):
print("Match 2")
if re.search(pattern,"rhythm myths"): # 匹配字符串中不包含aeiou中任意一个,因此返回None
print("Match 3")
The pattern [aeiou] in the search function matches all strings that contain any one of the characters defined.
- Character classes can also match ranges of characters.
Some examples:
The class [a-z] matches any lowercase alphabetic character.
The class [G-P] matches any uppercase character from G to P.
The class [0-9] matches any digit.
Multiple ranges can be included in one class. For example, [A-Za-z] matches a letter of any case.
pattern = r"[A-Z][A-Z][0-9]" # 三个字符构成:大写 大写 数字
if re.search(pattern,"LS8"):
print("Match 1")
if re.search(pattern,"E3"): # 只有两个字母
print("Match 2")
if re.search(pattern,"1ab"):
print("Match 3")
- Place a ^ at the start of a character class to invert it.
This causes it to match any character other than the ones included.
Other metacharacters such as $ and ., have no meaning within character classes.
The metacharacter ^ has no meaning unless it is the first character in a class.
”^“:在字符类中表示反转
pattern = r"[^A-Z]" # 不包含大写的字符
if re.search(pattern,"this is all quiet"): # 输出
print("Match 1")
if re.search(pattern,"AbcdEfG123"): # 存在非大写字母
print("Match 2")
if re.search(pattern,"THISISALL"): # 不能输出
print("Match 3")
NOTE:The pattern [^A-Z] excludes uppercase strings.
Note, that the ^ should be inside the brackets to invert the character class.
More Metacharacters
- Some more metacharacters are + ? { and }.
These specify numbers of repetitions.
The metacharacter means “zero or more repetitions of the previous thing”. It tries to match as many repetitions as possible. The “previous thing” can be a single character, a class, or a group of characters in parentheses.
pattern = r"egg(spam)*" # spam可以出现零次或多次
if re.match(pattern,"egg"):
print("Match 1")
if re.match(pattern,"eggspamspamegg"): #
print("Match 2")
if re.match(pattern,"spam"):
print("Match 3")
字符串要以egg开头
*匹配前面的子表达式零次或多次
()内的内容表示的是一个子表达式,()本身不匹配任何东西,也不限制匹配任何 东西,只是把括号内的内容作为同一个表达式来处理
The example above matches strings that start with "**egg"** and follow with zero or more "**spam"**s.
- The metacharacter + is very similar to , except it means “*one or more repetitions”, as opposed to “zero or more repetitions”.
+:匹配前面的子表达式一次或多次 ```python pattern = r”g+”
if re.match(pattern,”g”): print(“Match 1”)
if re.match(pattern,”ggggg”): # print(“Match 2”)
if re.match(pattern,”abc”): print(“Match 3”)
-
The metacharacter **?** means "zero or one repetitions".
```python
pattern = r"ice(-)?cream"
if re.match(pattern,"ice-cream"):
print("Match 1")
if re.match(pattern,"icecream"): #
print("Match 2")
if re.match(pattern,"sausages"):
print("Match 3")
if re.match(pattern, "ice--cream"):
print("Match 4")
?:匹配前面的子表达式零次或一次,例子中则是”-“出现一次或0次
- Curly Braces(大括号)
Curly braces can be used to represent the number of repetitions between two numbers.
The regex {x,y} means “between x and y repetitions of something”.
Hence {0,1} is the same thing as ?.(匹配x到y次)
If the first number is missing, it is taken to be zero. If the second number is missing, it is taken to be infinity.(x默认为0,y默认为无穷) ```python pattern = r”9{1,3}$”
if re.match(pattern,”9”): print(“Match 1”)
if re.match(pattern,”999”): # print(“Match 2”)
if re.match(pattern,”9999”): print(“Match 3”) ```
“9{1,3}$” 匹配有1~3个9的字符串