正则表达式 - Pattern和Matcher - 《Huang Java 开发学习》

Pattern和Matcher

Pattern和Matcher

构建正则表达式对象：

导入java.util.regex包
用static Pattern.compile()方法编译正则表达式
根据String类型的正则表达式生成一个Pattern对象
把将要检索的字符串传入Pattern对象的matcher()方法。matcher()方法会生成一个Matcher对象

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
    public static void main(String[] args){
        String[] str={"abcabcabcdefabc","abc+","(abc)+","(abc){2,}"};
        System.out.println("Input:\""+str[0]+"\"");
        for(String s : str){
            System.out.println("Regular expression:\""+s+"\"");
            Pattern p=Pattern.compile(s);
            Matcher m = p.matcher(str[0]);
            while(m.find()){
                System.out.println("Match \""+m.group()+"\" at positions "+m.start()+"-"+(m.end()-1));
            }
            System.out.println();
        }
    }
}
/* Output:
Input:"abcabcabcdefabc"
Regular expression:"abcabcabcdefabc"
Match "abcabcabcdefabc" at positions 0-14
Regular expression:"abc+"
Match "abc" at positions 0-2
Match "abc" at positions 3-5
Match "abc" at positions 6-8
Match "abc" at positions 12-14
Regular expression:"(abc)+"
Match "abcabcabc" at positions 0-8
Match "abc" at positions 12-14
Regular expression:"(abc){2,}"
Match "abcabcabc" at positions 0-8
*/

Pattern类还提供了static方法：

static boolean metches(String regex,CharSequence input) 
// 该方法用以检查regex是否匹配整个CharSequence类型的input参数。

编译后的Pattern对象还提供了split()方法，它从匹配了regex的地方分割字符串，返回分割后的子字符串String数组。
通过调用Pattern.matcher()方法，并传入一个字符串参数，我们得到了一个Matcher对象。使用Matcher上的方法，我们将能够判断各种不同类型的匹配是否成功：

boolean matches() // 用来判断整个输入字符串是否匹配正则表达式模式
boolean lookingAt() // 用来判断该字符串(不必是整个字符串)的始部分是否能够匹配模式
boolean find()
boolean find(int start)

find()

Matcher.find()方法可以用来在CharSequence中查找多个匹配。

import java.util.regex.*;
public class Test {
    public static void main(String[] args){
        Matcher m = Pattern.compile("\\w+").matcher("Evening is full of the linnet's wings");
        while(m.find())
            System.out.print(m.group()+" ");
        System.out.println();
        int i=0;
        while (m.find(i)){
            System.out.print(m.group()+" ");
            i++;
        }
    }
}
/* Output:
Evening is full of the linnet s wings 
Evening vening ening ning ing ng g is is s full full ull ll l of of f the the he e linnet linnet innet nnet net et t s s wings wings ings ngs gs s 
*/
/* 解析：
模式\\w+将字符串划分为单词。find()像迭代器那样前向遍历输入字符串。而第二个find()能够接收一个整数作为参数，该整数表示字符串中字符的位置，并以其作为搜索的起店。从结果中可以看出，后一个版本的find()方法能够根据其参数的值，不断重新设定搜索的起始位置。
*/

组(Groups)

组是用括号划分的正则表达式，可以根据组的编号来引用某个组。组号为0表示整个表达式，组好1表示被第一对括号括起的组，依此类推。
因此，A(B(C))D这个表达式中有三个组：组0是ABCD，组1是BC，组2是C。

Matcher对象提供了一系列方法，用以获取与组相关的信息：

public int groupCount() // 返回该匹配器的模式中的分组数目，第0组不包括在内。
public String group() // 返回前一次匹配操作(例如find())的第0组(整个匹配)。
public String group(int i) // 返回在前一次匹配操作期间指定的组号，如果匹配成功，但指定的组没有匹配输入字符串的任何部分，则将会返回null
public int start(int group) // 返回在前一次匹配操作中寻找到的组的起始索引。
public int end(int group) // 返回在前一次匹配操作中寻找到的组的最后一个字符索引加一的值。

import java.util.regex.*;
public class Groups {
    static public final String POEM=
            "Twas brillig, and the slithy toves\n"+
            "Did gyre and gimble in the wabe.\n"+
            "All mimsy were the borogoves,\n"+
            "And the mome raths outgrabe.\n\n"+
            "Beware the Jabberwock, my son,\n"+
            "The jaws that bite, the claws that catch,\n"+
            "Beware the Jubjub bird, and shun\n"+
            "The frumious Bandersnatch.";
    public static void main(String[] args){
        Matcher m = Pattern.compile("(?m)(\\S+)\\s+((\\S+)\\s+(\\S+))").matcher(POEM);
        while(m.find()){
            for (int j =0;j<=m.groupCount();j++)
                System.out.print("["+m.group(j)+"]");
            System.out.println();
        }
    }
}
/* Output:
[the slithy toves][the][slithy toves][slithy][toves]
[in the wabe.][in][the wabe.][the][wabe.]
[were the borogoves,][were][the borogoves,][the][borogoves,]
[mome raths outgrabe.][mome][raths outgrabe.][raths][outgrabe.]
[Jabberwock, my son,][Jabberwock,][my son,][my][son,]
[claws that catch,][claws][that catch,][that][catch,]
[bird, and shun][bird,][and shun][and][shun]
[The frumious Bandersnatch.][The][frumious Bandersnatch.][frumious][Bandersnatch.]
*/
/* 解析：
    该正则表达式由任意数目的非空字符(\S+)及随后的任意数目的空格字符(\s+)所组成，目的是捕获每行的最后三个词，每行最后以$结束。不过在正常情况下是将$与整个输入序列的末端相匹配。所以我们一定要显示地告知正则表达式注意输入序列中的换行符。这可以由序列开头的模式标记(?m)来完成
*/

start()与end()

在匹配操作成功后，start()返回先前匹配的起始位置的索引，而end()返回所匹配的最后字符的索引加一的值。
匹配操作失败后调用start()或end()将会产生IllegealStateException。

// 该示例还同时展示了matches()和lookingAt()的用法
import java.util.regex.*;
public class StartEnd {
    public static String input=
            "As long as there is injustice, whenever a\n"+
            "Targathian baby cries out, wherever a distress\n"+
            "signal sounds among the stars ... we'll be there.\n"+
            "This fine ship, and this fine crew ...\n"+
            "Never give up! Never surrender!";
    private static class Display{
        private boolean regexPrinted = false ;
        private String regex;
        Display(String regex){ this.regex=regex;    }
        void display(String message){
            if(!regexPrinted){
                System.out.println(regex);
                regexPrinted=true;
            }
            System.out.println(message);
        }
    }
    static void examine(String s,String regex){
        Display d = new Display(regex);
        Matcher m = Pattern.compile(regex).matcher(s);
        while (m.find()){
            d.display("find() '"+m.group()+"' start = "+m.start()+" end = "+m.end());
        }
        if (m.lookingAt()) // No reset() necessary
            d.display("lookingAt() start = "+m.start()+" end = "+m.end());
        if (m.matches()) // No reset() necessary
            d.display("matches() start = "+m.start()+" end = "+m.end());
    }
    public static void main(String[] args){
        for (String in : input.split("\n")){
            System.out.println("input: "+ in);
            for (String regex : new String[]{"\\w*ere\\w*","\\w*ever","T\\w+","Never.*?!"})
                examine(in,regex);
        }
    }
}
/* Output:
input: As long as there is injustice, whenever a
\w*ere\w*
find() 'there' start = 11 end = 16
\w*ever
find() 'whenever' start = 31 end = 39
input: Targathian baby cries out, wherever a distress
\w*ere\w*
find() 'wherever' start = 27 end = 35
\w*ever
find() 'wherever' start = 27 end = 35
T\w+
find() 'Targathian' start = 0 end = 10
lookingAt() start = 0 end = 10
input: signal sounds among the stars ... we'll be there.
\w*ere\w*
find() 'there' start = 43 end = 48
input: This fine ship, and this fine crew ...
T\w+
find() 'This' start = 0 end = 4
lookingAt() start = 0 end = 4
input: Never give up! Never surrender!
\w*ever
find() 'Never' start = 0 end = 5
find() 'Never' start = 15 end = 20
lookingAt() start = 0 end = 5
Never.*?!
find() 'Never give up!' start = 0 end = 14
find() 'Never surrender!' start = 15 end = 31
lookingAt() start = 0 end = 14
matches() start = 0 end = 31
*/

注意：find()可以在输入的任意位置定位正则表达式，而lookingAt()和matches()只有在正则表达式与输入的最开始处就开始匹配时才会成功。matches()只有在整个输入都匹配正则表达式时才会成功，而lookingAt()只要输入的第一部分匹配就会成功。

Pattern标记

Pattern.compile()方法还有另一个版本，它接受一个标记参数，以调整匹配的行为：

Pattern Pattern.compile(String regex,int flag);

其中的flag来自以下的Pattern类中的常量

编译标记	效果
Pattern.CANON_EQ	两个字符当且仅当它们的完全规范分解想匹配时，就认为它们是匹配的。例如，如果我们指定这个标记，表达式a\u030A就会匹配字符串？。在默认的情况下，匹配不考虑规范的等价性。
Pattern.CASE_INSENSITIVE(?i)	默认情况下，大小写不敏感的匹配假定只有US-ASCII字符集中的字符才能进行。这个标记允许模式匹配不必考虑大小写。通过指定UNICODE_CASE标记及结合此标记，基于Unicode的大小写不敏感的匹配就可以开启了。
Pattern.COMMENTS(?x)	在这种模式下，空格符将被忽略掉，并且以#开始直到行末的注释也会被忽略掉。通过嵌入的标记表达式也可以开启Unix的行模式
Pattern.DOTALL(?s)	在dotall模式中，表达式“.”匹配所有字符，包括行终结符。默认情况下，“.”表达式不匹配行终结符
Pattern.MULTILINE(?m)	在多行模式下，表达式^和$分别匹配一行的开始和结束。^还匹配输入字符串的开始，而$还匹配输入字符串的结尾。默认情况下，这些表达式仅匹配输入的完整字符串的开始和结束
Pattern.UNICODE_CASE(?u)	当指定这个标记，并且开启CASE_INSENSITIVE时，大小写不敏感的匹配将按照与Unicode标准相一致的方式进行。默认情况下，大小写不敏感的匹配假定只能在US-ASCII字符集中的字符才能进行
Pattern.UNIX_LINES(?d)	在这种模式下，在.、^和$行为中，只识别行终结符\n

注意：你可以直接在正则表达式中使用其中的大多数标记，只需要将上表中括号中括起来的字符插入到正则表达式中，你希望它起作用的位置即可。

import java.util.regex.*;
public class ReFlag {
    public static void main(String[] args){
        Pattern p=Pattern.compile("^java",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
        Matcher m = p.matcher("java has regex\nJava has regex\n"+
                "JAVA has pretty good regular expressions\n"+
                "Regular expressions are in Java");
        while(m.find())
            System.out.println(m.group());
    }
}
/* Output:
java
Java
JAVA
*/

split()

split()方法将输入字符串断开成字符串对象数组，断开边界由下列正则表达式确定：

String[] split(CharSequence input)
String[] split(CharSequence input,int limit) // 可以限制将输入分割成字符串的数量

import java.util.Arrays;
import java.util.regex.*;
public class SplitDemo {
    public static void main(String[] args){
        String input="This!!unusual use!!of exclamation!!points";
        System.out.println(Arrays.toString(Pattern.compile("!!").split(input)));
        // Only do the first three
        System.out.println(Arrays.toString(Pattern.compile("!!").split(input,3)));
    }
}
/* Output:
[This, unusual use, of exclamation, points]
[This, unusual use, of exclamation!!points]
*/

替换操作

正则表达式特别便于替换文本，它提供了许多方法：

replaceFirst(String replacement) // 以参数字符串replacement替换掉第一个匹配成功的部分
replaceAll(String replacement) // 以参数字符串replacement替换掉所有匹配成功的部分
appendReplacement(StringBuffer sbuf,String replacement) 
// 执行渐进式的替换，它允许你调用其他方法来生成或处理replacement（replaceFirst()和replaceAll()则只能使用一个固定的字符串），使你能够以编程的方式将目标分割成组，从而具备更强大的替换功能。
appendTail(StringBuffer subf) // 在执行了一次或者多次appendReplacement() 后，调用此方法可以将输入字符串余下的部分复制到sbuf中

import java.util.regex.*;
public class TheReplacements {
    public static void main(String[] args){
        String s="Here's a block of text to use as input to\n" +
                "the regular expression matcher.Note that we'll\n" +
                "first extract the block of text by looking for\n" +
                "the special delimiters, then process the\n" +
                "extracted block.\n";
        Matcher mInput = Pattern.compile("/\\*!(.*)!\\*/",Pattern.DOTALL).matcher(s);
        if (mInput.find())
            s=mInput.group(1); // Captured by parentheses
        s=s.replaceAll(" {2,} "," "); // Replace 2 or more spaces with a single space
        // Replace 1 or more spaces at the beginning of each line with no spaces. Must enable NULTILINE mode.
        s=s.replaceAll("(?m)^ +","");
        System.out.println(s);
        s= s.replaceFirst("[aeiou]","(VOWEL1)");
        StringBuffer sbuf =new StringBuffer();
        Pattern p =Pattern.compile("[aeiou]");
        Matcher m=p.matcher(s);
        // Process the find information sa you perform the replacements:
        while (m.find())
            m.appendReplacement(sbuf,m.group().toUpperCase());
        //  Put in the remainder of the text
        m.appendTail(sbuf);
        System.out.println(sbuf);
    }
}
/* Output:
Here's a block of text to use as input to
the regular expression matcher.Note that we'll
first extract the block of text by looking for
the special delimiters, then process the
extracted block.
H(VOWEL1)rE's A blOck Of tExt tO UsE As InpUt tO
thE rEgUlAr ExprEssIOn mAtchEr.NOtE thAt wE'll
fIrst ExtrAct thE blOck Of tExt by lOOkIng fOr
thE spEcIAl dElImItErs, thEn prOcEss thE
ExtrActEd blOck.
*/
/* 解析：
这两个替换操作所使用的replaceAll()是String对象自带的方法，在这里，使用此方法更方便。注意，因为这两个替换操作都只使用了一次replaceAll()，所以，与其编译为Pattern，不如直接使用S的replaceAll()方法，而且开销也更小些。
replaceFirst()和replaceAll()方法用来替换的至少普通的字符串，所以，如果想对这些替换字符串执行某些特殊处理，这两个方法是无法胜任的。如果你想要这么做，就应该使用appendReplacement()方法。还方法允许你在执行替换的过程中，操作用来替换的字符串。
该例子，先构造了sbuf用来保存最终结果，然后用group()选择一个组，并将其进行处理，将正则表达式找到的元音字面转换为大写字母。一般情况下，你应该遍历执行所有的替换操作，然后再调用appendTail()方法，但是，如果你想模拟replaceFirst()（或替换n次）的行为，那就只需执行一次替换，然后调用appendTail()方法，将剩余为处理的部分存入sbuf即可。
*/

reset()

通过reset()方法，可以将现有的Matcher对象应用于一个新的字符序列：

import java.util.regex.*;
public class Resetting {
    public static void main(String[] args){
        Matcher m = Pattern.compile("[frb][aiu][gx]").matcher("fix the rug with bags");
        while (m.find())
            System.out.print(m.group()+" ");
        System.out.println();
        m.reset("fix the rig with bags");
        while (m.find())
            System.out.print(m.group()+" ");
    }
}
/* Output ：
fix rug bag 
fix rig bag
*/

使用不带参数的reset()方法，可以将Matcher对象重新设置到当前字符序列的初始位置。