translator: http://www.jobbole.com/members/q3014225652/ reviewer: http://www.jobbole.com/members/hanxiaomax/

via: https://www.tutorialdocs.com/article/regex-trap.html

Hidden Traps in Regular Expressions

正则表达式的隐藏陷阱,你都了解么?

A few days ago, a monitoring system for an online project reported an exception suddenly. After checking the usage of the related resources, we found that the CPU utilization rate was nearly 100%. Then we exported the stack information for the problem with the thread Dump tool that comes with Java.

几天前,一个在线项目的监控系统突然报告了一个异常。在检查了相关资源的使用率后,我们发现 CPU 利用率接近 100%。然后,我们用 Java 附带的线程转储工具导出了这个异常相关的堆栈信息。

正则表达式的陷进你都了解么? - 图1

We can see that all the stacks point to a method called validateUrl, which gets more than 100 error messages on the stack. By troubleshooting the code, we know that the main function of the method is to verify whether the URL is legal.

我们发现,所有堆栈信息都指向一个名为 “validateUrl” 的方法,它在堆栈上有超过 100 个错误消息。通过检查代码,我们发现该方法的主要功能是验证 URL 的合法性。

So how a regex can lead to a high CPU utilization. In order to reproduce the problem, we extract the key code and make a simple unit test.

一个正则表达式是如何导致如此高的 CPU 利用率的呢?为了重现这个错误,我们提取了关键代码,并进行了简单的单元测试。

  1. public static void main(String[] args) {
  2. String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\\\/])+$";
  3. String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
  4. if (bugUrl.matches(badRegex)) {
  5. System.out.println("match!!");
  6. } else {
  7. System.out.println("no match!!");
  8. }
  9. }

When we run the example above, through the resource monitor we can see that a process called java has a CPU utilization that has soared to 91.4%.

当我们运行上面的示例时,资源监视器显示,一个名为 Java 的进程 CPU 利用率已经飙升到 91.4%。

正则表达式的陷进你都了解么? - 图2

Now we almost can make a judge that the regex is the reason that leads to a high CPU utilization!

现在我们几乎可以判断,正则表达式是导致 CPU 利用率飙升的原因。

So, let’s focus on the regex:

所以,让我们聚焦于正则表达式:

  1. ^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$

The regex looks fine and can be divided into three parts:

这个正则表达式看起来并没有什么异常。它可以分为三部分:

It matches the http and https protocols in the first part, matches the www. character in the second part, and matches other characters in the third part. I stared at the regex for a long time and didn’t find any big problem.

第一部分用于匹配 httphttps 协议。 第二部分用于匹配 www. 字符。第三部分用于匹配剩余字符。我盯着这个正则表达式看了很久,也没发现什么大问题。

In fact, the key reason for the high CPU usage here is that the engine implementation used by Java regex is the NFA, which performs backtracking when character matching is performed. Once the backtracking occurs, the time it takes will become very long. It may be a few minutes or even a few hours. The amount of time depends on the number and complexity of backtracking.

事实上,Java 用来处理正则表达式所用的 NFA 引擎是引起高 CPU 利用率的关键。当进行字符匹配时, NFA 会使用一种称为“回朔法”(backtracking)的方法。一旦发生回溯,所花费的时间将变得非常长。可能是几分钟,也可能长达数小时。所需时间的长短取决于发生回溯的次数和回溯的复杂度。

By the way, maybe some still hasn’t been very clear about what backtracking is. It doesn’t matter, let’s start with the principles of regex.

也许有些人还不太清楚回溯是什么。没关系,让我们从正则表达式的原理开始。

Regex Engine

正则表达式引擎

Regex is a set of convenient match symbols. To achieve such a complex and powerful matching syntax, we must have a set of algorithms, and the implementation of the algorithm is called the regex engine. Simply put, there are two ways to implement a regex engine: the DFA (Deterministic Final Automata) and the NFA (Non deterministic Finite Automaton).

正则表达式是一组便于匹配的符号。为了实现如此复杂且强大的匹配语法,我们必须有一组算法,算法的实现称为正则表达式引擎。简而言之,正则表达式的实现引擎有两种:一种是 DFA (有穷确定自动机 Deterministic Final Automata),另一种是 NFA (有穷非确定自动机 Non deterministic Finite Automaton).

These two Automatons are different, and we’re not going to go deep into their principles. Simply put, the time complexity of the DFA is linear, which is more stable but has limited functionality. The time complexity of NFA is relatively unstable, so sometimes it’s very good and sometimes it’s not, depending on the regex you write. But the advantage of NFA is that its functionality is even more powerful, so languages ​​such as Java, .NET, Perl, Python, Ruby, and PHP use NFA to implement their regex.

这是两种不同的自动机。在这里,我们不会深入讨论它们的原理。简单地说,DFA 的时间复杂度是线性的。它更稳定,但功能有限。NFA 的时间复杂度相对不稳定。 根据正则表达式的不同,时间有时长,有时短。NFA 的优点是它的功能更强大,所以被 Java、.NET、Perl、Python、Ruby 和 PHP 用来处理正则表达式。

How does the NFA match? We use the following characters and expressions as examples.

NFA 是怎样进行匹配的呢?我们用下面的字符串和表达式作为例子。

  1. text="Today is a nice day."
  2. regex="day"

Remember that NFA match is based on regex. That is, NFA will read one character of the regex and match it with the target string. If the match succeeds, it will turn to the next character of the regex, otherwise it will continue to compare with the next character of the target string.

记住,NFA 匹配是基于正则表达式的。也就是说,NFA 将依次读取正则表达式的匹配符,并将其与目标字符串进行匹配。如果匹配成功,它将转到正则表达式的下一个匹配符。否则,它将继续与目标字符串的下一个字符进行比较。

Let’s go through the examples above step by step.

让我们一步一步地来看一下上面的例子。

  • First, take the first match character of the regex: d. Then compare it with the first character of the string, which is T. It doesn’t match, so turn to the next one. The second character is o, and it doesn’t match either. So move on to the next one, which is d now. It matches. Then read the second character of the regular: a.

  • 首先,提取正则表达式的第一个匹配符:d。然后,将它与字符串的第一个字符 T 进行比较。不匹配,所以转到下一个。第二个字符是 o,也不匹配。继续转到下一个,也就是 d。匹配成功。于是,读取正则表达式的第二个字符: a

  • The second match character of the regex: a. And it will be compared with the fourth character of the string a. It matches again. Then go on to read the third character of the regex y.

  • 正则表达式的第二个匹配符是:a。将它与字符串的第四个字符 a 进行比较。又匹配了。于是继续读取正则表达式的第三个字符 y

  • The third match character of the regex is y. Let’s continue to match it with the fifth character of the string, and it matches. Then try to read the next character of the regex and find that there is none, so the match ends.

  • 正则表达式的第三个匹配符是 y。让我们继续与字符串的第五个字符比较。匹配成功。接着,尝试读取正则表达式的下一个字符,发现没有字符了,因此匹配结束。

The above is the matching process of the NFA, and the actual matching process is much more complicated. However, the principle of matching is the same.

以上是 NFA 的匹配过程。实际的匹配过程要复杂得多。不过,匹配的原理都是一样的。

Backtracking of NFA

NFA 回溯法

Now that you’ve learned how NFA performs string matching, let’s talk about the focus of the article: Backtracking. In order to explain the backtracking better, we’ll use the following example.

现在,你已经了解了 NFA 是如何进行字符串匹配的。下面,让我们来看一下本文的重点:回溯法。我们将使用下面的例子,以便更好的解释回朔法。

  1. text="abbc"
  2. regex="ab{1,3}c"

This is a relatively simple example. The regex starts with a and ends with c, and between them there is a string of 1-3 b characters. The match process of NFA is like this:

这是一个比较简单的例子。正则表达式以 a 开始,以 c 结束。它们之间有以 1-3 个 b 组成的字符串。NFA 的匹配过程如下:

  • First, take the first match character of the regex, which is a, and compare it with the first character of the string a. It matches, so move to the second character of the regex.

  • 首先,读取正则表达式的第一个匹配符 a,并将其与字符串的第一个字符 a 进行比较。两者匹配,所以,移动到正则表达式的第二个字符。

  • Take the second match character of the regex, which isb{1,3},and compare it with the second character of the string b. It matches again. But since b{1,3} represents 1-3 b strings and the greedy nature of the NFA (that is, to match as much as possible), it won’t read the next character of the regex at this time but still compare b{1,3} with the third character of the string, which is b too. And it matches too. Then it will continue using b{1,3} to be compared with the fourth character of the string c, and find that it doesn’t match. Backtracking occurs at this point.

  • 读取正则表达式的第二个匹配符 b{1,3},将它与字符串的第二个字符 b 进行比较。它们又匹配了。b{1,3} 代表 1-3 个 b,基于 NFA 的贪婪特性(即,尽可能地进行匹配),此时它不会读取正则表达式的下一个匹配符,而是仍然使用b{1,3} 与字符串的第三个字符 b 进行比较。它们匹配了。于是继续用 b{1,3} 与字符串的第四个字符 c 进行比较。它们不匹配。 回溯 就出现在这里

  • How does backtracking work? After the backtracking, the fourth character (that is c) of the string which has been read will be spit out and the pointer will return to the third character of the string. After that, it will read the next character c of the regex, and compare it with the next character c of the current pointer, and it matches. Then read next, but it’s over.

  • 回溯是如何进行的?回溯后,字符串中已被读取的第四个字符 c 将被放弃。指针将返回到字符串的第三个字符。接着,正则表达式的下一个匹配符 c 会被用来与待匹配字符串当前指针的下一个字符 c 进行对比。两者是匹配的。这时,字符串最后一个字符已经被读取。匹配结束。

Let’s go back and have a look at that regex which is used to validate a URL:

让我们回过头来看看用于验证 URL 的正则表达式。

  1. ^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$

The URL where the problem occurred is:

出现问题的 URL 如下:

  1. http://www.fapiao.com/dzfp-web/pdf/download?request=6e7JGm38jfjghVrv4ILd-kEn64HcUX4qL4a4qJ4-CHLmqVnenXC692m74H5oxkjgdsYazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf

We divide the regex into three parts:

  • Part 1: The verification protocol.^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://).
  • Part 2: Verify the domain. (([A-Za-z0-9-~]+).)+.
  • Part 3: Verify the parameters. ([A-Za-z0-9-~\\/])+$.

我们将正则表达式分为三个部分:

  • 第1部分:验证协议。 ^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)
  • 第2部分:验证域。(([A-Za-z0-9-~]+).)+
  • 第3部分:验证参数。([A-Za-z0-9-~\\/])+$

It can be found that there is no problem with the part of the regex verification protocol http://, but when verifying www.fapiao.com, it uses the way of xxxx. to verify. So the matching process is like this:

  • Match to www.
  • Matches to fapiao.
  • Match to com/dzfp-web/pdf/download?request=6e7JGm38jf....., you will see that because of the greedy nature, the program will always try to read the subsequent strings to match, and finally it find that there is no dot, so it starts characters backtracking one by one.

可以发现,验证 http:// 协议这部分的正则表达式没有什么问题。但是,当用 xxxx. 验证 www.fabiao.com 时,匹配过程如下:

  • 匹配 www
  • 匹配 fapiao
  • 匹配 com/dzfp-web/pdf/download?request=6e7JGm38jf.....。由于“贪婪”的性质,程序会一直试图读取下一个字符进行匹配,直至上述一长串字符串读取完毕,发现找不到点号。此时,程序开始一个字符一个字符进行回溯。

This is the first problem within the regex.

这是正则表达式中的第一个问题。

Another problem is in the third part of the regex. It can be found that the URL with problem has underscore (_) and percent sign (%), but the regex corresponding to the third part doesn’t have. So only after matching for a long string of characters, it can be found that it doesn’t match and then go backtracking.

另一个问题出现在正则表达式的第三部分。可以看到,有问题的 URL 具有下划线(_)和百分号(),但是与第三部分对应的正则表达式中则没有。因此,只有在匹配完一长串字符之后,程序才发现两者不匹配,然后进行回溯。

This is the second problem within this regex.

这是这个正则表达式中的第二个问题。

Solution

解决方案

You’ve learned that backtracking is the cause of the problem. So the solution of the problem is reducing backtracking. In fact, you will find that if you add the underscore and the percent sign into the third part, the program will become normal.

你已经知道回溯是导致问题的原因。所以,解决问题的方法就是减少回溯。事实上,你会发现,如果把下划线和百分比符号添加到第三部分,程序就会变得正常。

  1. public static void main(String[] args) {
  2. String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~_%\\\\/])+$";
  3. String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
  4. if (bugUrl.matches(badRegex)) {
  5. System.out.println("match!!");
  6. } else {
  7. System.out.println("no match!!");
  8. }
  9. }

Run the above program and it will print out the match!!.

运行上面的程序,它会打印出“ match!! ”。

What if there are other URLs that contain messy characters in the future? Change it again? Certainly it’s not realistic!

如果未来其他的 URL 中含有别的混乱字符怎么办? 再次修正代码? 当然不现实!

In fact, there are three modes in regex: Greedy mode, Reluctant mode, and Possessive mode.

事实上,正则表达式有三种模式:贪婪模式勉强模式独占模式

If you add a sign ? in the regex, the Greedy mode will become Reluctant mode, that is, it will match as little as possible. However, backtracking will still occur in the Reluctant mode. For example:

如果你在正则表达式中添加一个 ? 标志,贪婪模式将变成勉强模式。此时,它将尽可能少地匹配。然而,勉强模式下回溯仍可能出现。例如:

  1. text="abbc"
  2. regex="ab{1,3}?c"

The first character of the regex: a, matches the first character of the string a. And the second operator of the regex, which is b{1,3}?, matches the second character b of the string. Because of the principle of minimum matching, the third operator c of the regex doesn’t match the third character b of the string. So it goes backtracking and compares the second operator of the regex b{1,3}? with the third character b of the string, and now the match is successful. Then the third character of the regex c matches the fourth character c of the string. Ends.

正则表达式的第一个字符 a 与字符串的第一个字符 a 相匹配。正则表达式的第二个运算符 b{1,3}? 匹配了字符串的第二个字符 b 。由于最小匹配的原则,正则表达式将读取第三个运算符 c,并与字符串第三个字符 b 进行比较。两者不匹配。因此,程序进行回溯并将正则表达式的第二个运算符 b{1,3}? 与字符串的第三个字符 b 进行比较。现在匹配成功了。之后,正则表达式的第三个匹配符 c 与字符串的第四个字符 c 正相匹配。匹配结束。

If you add a sign + instead, the original Greedy mode will become Exclusive mode, that is, it will match as much as possible, but won’t go backtracking.

如果添加 +标志,则原来的贪婪模式将变成独占模式。也就是说,它将匹配尽可能多的字符,但不会回溯。

Therefore, if you want to solve the problem completely, it must be guaranteed functionality while ensuring no backtracking. I add a plus sign to the second part of the regex that verifies the URL above:

因此,如果你想将这个问题完全解决。你必须保证表达式能正确的行使它的功能,同时确保没有回溯发生。我在上述验证 URL 的正则表达式的第二部分增添了一个加号:

  1. ^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/)
  2. (([A-Za-z0-9-~]+).)++ --->>> added + here)(“+”添加在这里)
  3. ([A-Za-z0-9-~_%\\\/])+$

Now there is no problem running the program.

现在,程序运行没有问题了。

Finally, I recommend a website that can check if there is a problem with the regex you write and the corresponding string match.

最后,我推荐一个网站。它可以检查你写的正则表达式以及相应的匹配字符串是否存在问题。

Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript

For example, the URL that has problem in this article will be prompted after using the site check: catastrophic backgracking.

例如,本文中存在问题的 URL 在使用上述网站检测后,会弹出如下提示:灾难性的回溯。

正则表达式的陷进你都了解么? - 图3

When you click on “regex debugger” in the bottom left corner, it will tell you how many steps have been checked, and will list all the steps and indicate where the backtracking occurred.

当你单击左下角的 “regex debugger” 时,它将告诉你已经进行了多少步匹配,列出所有的匹配步骤,并指出发生回溯的地方。

正则表达式的陷进你都了解么? - 图4

The regex in this article automatically stops after a 110,000-step attempt. It shows that the regex does have problems and needs to be improved.

本文中的正则表达式在 110,000 次尝试之后自动停止。这表明,正则表达式存在一定的问题并需要改进。

But when I test it with the modified regex as below:

但是,当我用如下修改后的正则表达式测试时:

  1. ^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/)(([A-Za-z0-9-~]+).)++([A-Za-z0-9-~\\\/])+$

It’s prompted that it takes only 58 steps to complete the check.

网站提示,仅用了 58 步就完成了匹配。

正则表达式的陷进你都了解么? - 图5

The difference of one character causes the huge performance gap.

一个字符的差异导致了巨大的性能差距。

Something to Say

一些补充

It’s amazing how a small regex can make the CPU die. It also gives us a wake-up call when encountering regex, it should be paid attention to greedy mode and backtracking problems.

一个小小的正则表达式也能神奇的让 CPU 卡死。这给我们提了一个醒。当遇到正则表达式的时候,一定要注意“贪婪模式”以及回溯问题。