rust regex

https://rust-lang.github.io/regex/regex/index.html

python regex

https://docs.python.org/zh-cn/3.9/library/re.html

正则表达式可以包含普通或者特殊字符。绝大部分普通字符，比如 ‘A’, ‘a’, 或者 ‘0’，都是最简单的正则表达式。它们就匹配自身。
你可以拼接普通字符，所以 last 匹配字符串 ‘last’.
（在这一节的其他部分，我们将用 this special style 这种方式表示正则表达式，通常不带引号，要匹配的字符串用 ‘in single quotes’ ，单引号形式。）

有些字符，比如 ‘|’ 或者 ‘(‘，属于特殊字符。特殊字符既可以表示它的普通含义，也可以影响它旁边的正则表达式的解释。
重复修饰符 (, +, ?, {m,n}, 等) 不能直接嵌套。这样避免了非贪婪后缀 ? 修饰符，和其他实现中的修饰符产生的多义性。要应用一个内层重复嵌套，可以使用括号。比如，表达式 (?:a{6}) 匹配6个 ‘a’ 字符重复任意次数。
特殊字符有：
.
(点) 在默认模式，匹配除了换行的任意字符。如果指定了标签 DOTALL ，它将匹配包括换行符的任意字符。
^
(插入符号) 匹配字符串的开头，并且在 MULTILINE 模式也匹配换行后的首个符号。
$
匹配字符串尾或者在字符串尾的换行符的前一个字符，在 MULTILINE 模式下也会匹配换行符之前的文本。 foo 匹配 ‘foo’ 和 ‘foobar’，但正则表达式 foo$ 只匹配 ‘foo’。更有趣的是，在 ‘foo1\nfoo2\n’ 中搜索 foo.$，通常匹配 ‘foo2’，但在 MULTILINE 模式下可以匹配到 ‘foo1’；在 ‘foo\n’ 中搜索 $ 会找到两个（空的）匹配：一个在换行符之前，一个在字符串的末尾。

对它前面的正则式匹配0到任意次重复，尽量多的匹配字符串。 ab 会匹配 ‘a’，’ab’，或者 ‘a’ 后面跟随任意个 ‘b’。
+
对它前面的正则式匹配1到任意次重复。 ab+ 会匹配 ‘a’ 后面跟随1个以上到任意个 ‘b’，它不会匹配 ‘a’。
?
对它前面的正则式匹配0到1次重复。 ab? 会匹配 ‘a’ 或者 ‘ab’。
?, +?, ??
‘‘, ‘+’，和 ‘?’ 修饰符都是 贪婪的；它们在字符串进行尽可能多的匹配。有时候并不需要这种行为。如果正则式 <.> 希望找到 ‘ b ‘，它将会匹配整个字符串，而不仅是 ‘‘。在修饰符之后添加 ? 将使样式以 非贪婪方式或者 :dfn:最小 方式进行匹配；尽量少的字符将会被匹配。使用正则式 <.?> 将会仅仅匹配 ‘‘。
{m}
对其之前的正则式指定匹配 m 个重复；少于 m 的话就会导致匹配失败。比如， a{6} 将匹配6个 ‘a’ , 但是不能是5个。
{m,n}
对正则式进行 m 到 n 次匹配，在 m 和 n 之间取尽量多。比如，a{3,5} 将匹配 3 到 5个 ‘a’。忽略 m 意为指定下界为0，忽略 n 指定上界为无限次。比如 a{4,}b 将匹配 ‘aaaab’ 或者1000个 ‘a’ 尾随一个 ‘b’，但不能匹配 ‘aaab’。逗号不能省略，否则无法辨别修饰符应该忽略哪个边界。
{m,n}?
前一个修饰符的非贪婪模式，只匹配尽量少的字符次数。比如，对于 ‘aaaaaa’， a{3,5} 匹配 5个 ‘a’ ，而 a{3,5}? 只匹配3个 ‘a’。
\
转义特殊字符（允许你匹配 ‘*’, ‘?’, 或者此类其他），或者表示一个特殊序列；特殊序列之后进行讨论。
如果你没有使用原始字符串（ r’raw’ ）来表达样式，要牢记Python也使用反斜杠作为转义序列；如果转义序列不被Python的分析器识别，反斜杠和字符才能出现在字符串中。如果Python可以识别这个序列，那么反斜杠就应该重复两次。这将导致理解障碍，所以高度推荐，就算是最简单的表达式，也要使用原始字符串。
[]
用于表示一个字符集合。在一个集合中：

字符可以单独列出，比如 [amk] 匹配 ‘a’， ‘m’，或者 ‘k’。

可以表示字符范围，通过用 ‘-‘ 将两个字符连起来。比如 [a-z] 将匹配任何小写ASCII字符， [0-5][0-9] 将匹配从 00 到 59 的两位数字， [0-9A-Fa-f] 将匹配任何十六进制数位。如果 - 进行了转义（比如 [a-z]）或者它的位置在首位或者末尾（如 [-a] 或 [a-]），它就只表示普通字符 ‘-‘。

特殊字符在集合中，失去它的特殊含义。比如 [(+)] 只会匹配这几个文法字符 ‘(‘, ‘+’, ‘‘, or ‘)’。

字符类如 \w 或者 \S (如下定义) 在集合内可以接受，它们可以匹配的字符由 ASCII 或者 LOCALE 模式决定。
不在集合范围内的字符可以通过取反来进行匹配。如果集合首字符是 ‘^’ ，所有不在集合内的字符将会被匹配，比如 [^5] 将匹配所有字符，除了 ‘5’， [^^] 将匹配所有字符，除了 ‘^’. ^ 如果不在集合首位，就没有特殊含义。
在集合内要匹配一个字符 ‘]’，有两种方法，要么就在它之前加上反斜杠，要么就把它放到集合首位。比如， [()[]{}] 和 [{}] 都可以匹配括号。
Unicode Technical Standard #18 里的嵌套集合和集合操作支持可能在未来添加。这将会改变语法，所以为了帮助这个改变，一个 FutureWarning 将会在有多义的情况里被 raise，包含以下几种情况，集合由 ‘[‘ 开始，或者包含下列字符序列 ‘—‘, ‘&&’, ‘~~’, 和 ‘||’。为了避免警告，需要将它们用反斜杠转义。

在 3.7 版更改: 如果一个字符串构建的语义在未来会改变的话，一个 FutureWarning 会 raise 。
|
A|B， A 和 B 可以是任意正则表达式，创建一个正则表达式，匹配 A 或者 B. 任意个正则表达式可以用 ‘|’ 连接。它也可以在组合（见下列）内使用。扫描目标字符串时， ‘|’ 分隔开的正则样式从左到右进行匹配。当一个样式完全匹配时，这个分支就被接受。意思就是，一旦 A 匹配成功， B 就不再进行匹配，即便它能产生一个更好的匹配。或者说，’|’ 操作符绝不贪婪。如果要匹配 ‘|’ 字符，使用 |，或者把它包含在字符集里，比如 [|].
(…)
（组合），匹配括号内的任意正则表达式，并标识出组合的开始和结尾。匹配完成后，组合的内容可以被获取，并可以在之后用 \number 转义序列进行再次匹配，之后进行详细说明。要匹配字符 ‘(‘ 或者 ‘)’, 用 ( 或 ), 或者把它们包含在字符集合里: [(], [)].
(?…)
这是个扩展标记法（一个 ‘?’ 跟随 ‘(‘ 并无含义）。 ‘?’ 后面的第一个字符决定了这个构建采用什么样的语法。这种扩展通常并不创建新的组合； (?P…) 是唯一的例外。以下是目前支持的扩展。
(?aiLmsux)
( ‘a’, ‘i’, ‘L’, ‘m’, ‘s’, ‘u’, ‘x’ 中的一个或多个) 这个组合匹配一个空字符串；这些字符对正则表达式设置以下标记 re.A (只匹配ASCII字符), re.I (忽略大小写), re.L (语言依赖), re.M (多行模式), re.S (点dot匹配全部字符), re.U (Unicode匹配), and re.X (冗长模式)。 (这些标记在模块内容中描述) 如果你想将这些标记包含在正则表达式中，这个方法就很有用，免去了在 re.compile() 中传递 flag 参数。标记应该在表达式字符串首位表示。
(?:…)
正则括号的非捕获版本。匹配在括号内的任何正则表达式，但该分组所匹配的子字符串不能在执行匹配后被获取或是之后在模式中被引用。
(?aiLmsux-imsx:…)
(‘a’, ‘i’, ‘L’, ‘m’, ‘s’, ‘u’, ‘x’ 中的0或者多个，之后可选跟随 ‘-‘ 在后面跟随 ‘i’ , ‘m’ , ‘s’ , ‘x’ 中的一到多个 .) 这些字符为表达式的其中一部分设置或者去除相应标记 re.A (只匹配ASCII), re.I (忽略大小写), re.L (语言依赖), re.M (多行), re.S (点匹配所有字符), re.U (Unicode匹配), and re.X (冗长模式)。(标记描述在模块内容 .)
‘a’, ‘L’ and ‘u’ 作为内联标记是相互排斥的，所以它们不能结合在一起，或者跟随 ‘-‘ 。当他们中的某个出现在内联组中，它就覆盖了括号组内的匹配模式。在Unicode样式中， (?a:…) 切换为只匹配ASCII， (?u:…) 切换为Unicode匹配 (默认). 在byte样式中 (?L:…) 切换为语言依赖模式， (?a:…) 切换为只匹配ASCII (默认)。这种方式只覆盖组合内匹配，括号外的匹配模式不受影响。
特殊字符是 :
“.” Matches any character except a newline.
“^” Matches the start of the string.
“$” Matches the end of the string or just before the newline at
the end of the string.
““ Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.
“+” Matches 1 or more (greedy) repetitions of the preceding RE.
“?” Matches 0 or 1 (greedy) of the preceding RE.
?,+?,?? Non-greedy versions of the previous three special characters.
{m,n} Matches from m to n repetitions of the preceding RE.
{m,n}? Non-greedy version of the above.
“\“ Either escapes special characters or signals a special sequence.
[] Indicates a set of characters.
A “^” as the first character indicates a complementing set.
“|” A|B, creates an RE that will match either A or B.
(…) Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
(?aiLmsux) The letters set the corresponding flags defined below.
(?:…) Non-grouping version of regular parentheses.
(?P…) The substring matched by the group is accessible by name.
(?P=name) Matches the text matched earlier by the group named name.
(?#…) A comment; ignored.
(?=…) Matches if … matches next, but doesn’t consume the string.
(?!…) Matches if … doesn’t match next.
(?<=…) Matches if preceded by … (must be fixed length).
(?<!…) Matches if not preceded by … (must be fixed length).
(?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
the (optional) no pattern otherwise.

The special sequences consist of “\“ and a character from the list
below. If the ordinary character is not on the list, then the
resulting RE will match the second character.
\number Matches the contents of the group of the same number.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set [0-9] in
bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the whole
range of Unicode digits.
\D Matches any non-digit character; equivalent to [^\d].
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the whole
range of Unicode whitespace characters.
\S Matches any non-whitespace character; equivalent to [^\s].
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9]
in bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the
range of Unicode alphanumeric characters (letters plus digits
plus underscore).
With LOCALE, it will match the set [0-9] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.
\ Matches a literal backslash.

Rust 学习笔记

Regex

rust regex

python regex