Groups
A group can be created by surrounding part of a regular expression with parentheses.
This means that a group can be given as an argument to metacharacters such as * and ?.
使用()括起来表示groupsThe content of groups in a match can be accessed using the group function.
A call of group(0) or group() returns the whole match.
A call of group(n), where n is greater than 0, returns the nth group from the left.
The method groups() returns all groups up from 1. ```python import re
pattern = r”a(bc)(de)(f(g)h)i”
match = re.match(pattern, “abcdefghijklmnop”)
if match: print(match.group()) print(match.group(0)) # 同上,返回整个匹配 print(match.group(1)) # 第1个group:bc print(match.group(2)) # 第2个:de print(match.group(3)) # 第3个:fgh print(match.group(4)) # 第4个:g print(match.groups())
-
There are several kinds of special groups.<br />
Two useful ones are **named groups** and **non-capturing groups**.<br />
**Named groups** have the format **(?P...)**, where **name** is the name of the group, and **...** is the content. They behave exactly the same as normal groups, except they can be accessed by **group(name)** in addition to its number.(给group命名)<br />
**Non-capturing groups** have the format **(?:...)**. They are not accessible by the group method, so they can be added to an existing regular expression without breaking the numbering.(无法被group获取,用于添加而不影响编号)
```python
import re
pattern = r"(?P<first>abc)(?:def)(ghi)"
match = re.match(pattern, "abcdefghi")
if match:
print(match.group("first"))
print(match.groups())
- Another important metacharacter is |.
This means “or”, so red|blue matches either “red” or “blue”. ```python import re
pattern = r”gr(a|e)y”
match = re.match(pattern, “gray”) if match: print (“Match 1”)
match = re.match(pattern, “grey”) if match: print (“Match 2”)
match = re.match(pattern, “griy”) if match: print (“Match 3”)
<a name="e990100b"></a>
# Special Sequences
-
There are various **special sequences** you can use in regular expressions. They are written as a backslash followed by another character.<br />
One useful special sequence is a backslash and a number between 1 and 99, e.g., \1 or \17. This matches the expression of the group of that number.
<br />正则表达式中的小括号"()"。是代表分组的意思。 如果再其后面出现\1则是代表与第一个小括号中要匹配的内容相同,\17则是与第17个小括号内容要相同,注意此方法要与()连用
```python
import re
pattern = r"(.+) \1"
match = re.match(pattern, "word word")
if match:
print ("Match 1")
match = re.match(pattern, "?! ?!")
if match:
print ("Match 2")
match = re.match(pattern, "abc def")
if match:
print ("Match 3")
- More useful special sequences are \d, \s, and \w.
These match digits(数字), whitespace(空格), and word characters(单词字符) respectively.
In ASCII mode they are equivalent to [0-9], [ \t\n\r\f\v], and [a-zA-Z0-9_].
In Unicode mode they match certain other characters, as well. For instance, \w matches letters with accents.
Versions of these special sequences with upper case letters - \D, \S, and \W - mean the opposite to the lower-case versions. For instance, \D matches anything that isn’t a digit. ```python import re
pattern = r”(\D+\d)” # 匹配任意非数字+数字
match = re.match(pattern, “Hi 999!”) if match: print(“Match 1”)
match = re.match(pattern, “1, 23, 456!”) if match: print(“Match 2”)
match = re.match(pattern, “ ! $?”) if match: print(“Match 3”)
-
Additional special sequences are **\A**, **\Z**, and **\b**.<br />
The sequences **\A** and **\Z** match the beginning and end of a string, respectively.<br />
The sequence **\b** matches the empty string between **\w** and **\W** characters, or **\w** characters and the beginning or end of the string. Informally, it represents the boundary between words.<br />
The sequence **\B** matches the empty string anywhere else.
```python
import re
pattern = r"\b(cat)\b" # 匹配cat,其前后需为空白或非空,但不能为字母或数字等
match = re.search(pattern, "The cat sat!")
if match:
print ("Match 1")
match = re.search(pattern, "We s>cat<tered?")
if match:
print ("Match 2")
match = re.search(pattern, "We scattered.")
if match:
print ("Match 3")
Email Extraction
- To demonstrate a sample usage of regular expressions, lets create a program to extract email addresses from a string.
Suppose we have a text that contains an email address:str = "Please contact info@sololearn.com for assistance"
Our goal is to extract the substring “info@sololearn.com“.
A basic email address consists of a word and may include dots or dashes. This is followed by the @ sign and the domain name (the name, a dot, and the domain name suffix).
This is the basis for building our regular expression.
pattern = r"([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)"
[\w.-]+ matches one or more word character, dot or dash.
The regex above says that the string should contain a word (with dots and dashes allowed), followed by the @ sign, then another similar word, then a dot and another word.
Our regex contains three groups:
1 - first part of the email address.
2 - domain name without the suffix.
3 - the domain suffix.
- ```python import re
pattern = r”([\w.-]+)@([\w.-]+)(.[\w.]+)” str = “Please contact info@sololearn.com for assistance”
match = re.search(pattern, str) if match: print(match.group())
<br />In case the string contains multiple email addresses, we could use the **re.findall** method instead of **re.search**, to extract all email addresses.
<a name="12f88548"></a>
# Phone Number Validator
You are given a number input, and need to check if it is a valid phone number.<br />
A valid phone number has exactly 8 digits and starts with **1**, **8** or **9**.<br />
Output "Valid" if the number is valid and "Invalid", if it is not.
**Sample Input**<br />
81239870
**Sample Output**<br />
Valid
```python
import re
#your code goes here
pattern = r"\b^[189](\d){7}\b"
number = input()
if re.match(pattern, number):
print("Valid")
else:
print("Invalid")