2011年5月21日星期六

  正则匹配

正则在实际中具备很高的应用价值,学习java最好的网站就是 http://download.oracle.com/javase/tutorial/essential/regex/test_harness.html

下面是一个例子,到处Runnable Jar后运行java –jar XXX.jar就能尝试各种regx了。

 
import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
 
public class RegexTestHarness {
 
    public static void main(String[] args){
        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {
 
            Pattern pattern = 
            Pattern.compile(console.readLine("%nEnter your regex: "));
 
            Matcher matcher = 
            pattern.matcher(console.readLine("Enter input string to search: "));
 
            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text \"%s\" starting at " +
                   "index %d and ending at index %d.%n",
                    matcher.group(), matcher.start(), matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }
}
位置关系: 
 cells

字符匹配表达式:

Character Classes
[abc]     a, b, or c (simple class)
[^abc]     Any character except a, b, or c (negation)
[a-zA-Z]     a through z, or A through Z, inclusive (range)
[a-d[m-p]]     a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]     d, e, or f (intersection)
[a-z&&[^bc]]     a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]     a through z, and not m through p: [a-lq-z] (subtraction)

空白元字符:


\s   匹配空白符,如 空格、制表符和换行符
\n   匹配换行符或行末符
\r   匹配回车符
\t   匹配制表符
\f   匹配进纸符

预定义字符

Predefined Character Classes.     Any character (may or may not match line terminators)\d     A digit: [0-9]\D     A non-digit: [^0-9]\s     A whitespace character: [ \t\n\x0B\f\r]\S     A non-whitespace character: [^\s]\w     A word character: [a-zA-Z_0-9]\W     A non-word character: [^\w]
 
  • \d matches all digits
  • \s matches spaces
  • \w matches word characters
Alternatively, a capital letter means the opposite:
  • \D matches non-digits
  • \S matches non-spaces
  • \W matches non-word characters

 

匹配策略:

 Quantifiers Meaning Greedy             Reluctant      Possessive X?                    X??            X?+         X, once or not at all X*                    X*?            X*+         X, zero or more times X+                    X+?            X++         X, one or more times X{n}                  X{n}?          X{n}+       X, exactly n times X{n,}                 X{n,}?         X{n,}+      X, at least n times X{n,m}                X{n,m}?        X{n,m}+     X, at least n but not more than m times

Capturing groups:

 Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g". The portion of the input string that matches the capturing group will be saved in memory for later recall via backreferences (as discussed below in the section,Backreferences).

捕获组有两种形式
种是普通捕获组不产生歧义情况下后面简称捕获组语法规则:(expression);
另种是命名捕获组语法规则:(?
  
   expression)或者(?
   'name'expression)这两种写法是等价
  
1、编号规则
如果没有显式为捕获组命名即没有使用命名捕获组那么需要按数字顺序来访问所有捕获组
在只有普通捕获组情况下捕获组编号是按照“(”出现顺序从左到右编号
(\d{4})-(\d{2}-(\d\d))
1 1 2 3 3 2
上面正则表达式可以用来匹配格式为yyyy-MM-dd日期为了在下表中得以区分采用了\d{2}和\d\d两种写法
还有个默认编号为0组表示是正则表达式整体
用以上正则表达式匹配串:2008-12-31
匹配结果为:
编号 命名 捕获组 匹配内容
0 (\d{4})-(\d{2}-(\d\d)) 2008-12-31
1 (\d{4}) 2008
2 (\d{2}-(\d\d)) 12-31
3 (\d\d) 31
如果对组进行了显式命名即命名捕获组那么捕获内容可以通过组名称来引用
 

但是如果正则表达式中既使用了普通捕获组也使用了命名捕获组那么捕获组编号就要特别注意编号规则是先对普通捕获组进行编号再对命名捕获

组进行编号

(\d{4})-(?<date>\d{2}-(\d\d))
1 1 3 2 23
用以上正则表达式匹配串:2008-12-31
匹配结果为:
编号 命名 捕获组 匹配内容
0 (\d{4})-(\d{2}-(\d\d)) 2008-12-31
1 (\d{4}) 2008
2 (\d\d) 31
3 date (?<date>\d{2}-(\d\d)) 12-31
2、捕获组引用
对捕获组引用般有以下几种
a) 正则表达式中对前面捕获组捕获内容进行引用称为反向引用
b) 正则表达式中(?(表达式)true|false)条件表达式
c) 在中对捕获组捕获内容引用
反向引用
对于普通捕获组引用语法规则为:\k
  
   通常简写为\num其中num是十进制数字即捕获组编号
  
对于命名捕获组引用语法规则为:\k
  
   或者\k
   'name'
  

 

Boundary Matchers
 Boundary Matchers ^      The beginning of a line $      The end of a line \b      A word boundary \B      A non-word boundary \A      The beginning of the input \G      The end of the previous match \Z      The end of the input but for the final terminator, if any \z      The end of the input
To check if a pattern begins and ends on a word boundary (as opposed to a substring within a longer string), just use \b on either side; for example, \bdog\b
 
 
 
Enter your regex: \bdog\b
Enter input string to search: The dog plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.
 
Enter your regex: \bdog\b
Enter input string to search: The doggie plays in the yard.
No match found.
To match the expression on a non-word boundary, use \B instead:
 
Enter your regex: \bdog\B
Enter input string to search: The dog plays in the yard.
No match found.
 
Enter your regex: \bdog\B
Enter input string to search: The doggie plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.
To require the match to occur only at the end of the previous match, use \G:
 
Enter your regex: dog 
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.
I found the text "dog" starting at index 4 and ending at index 7.
 
Enter your regex: \Gdog 
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.
Here the second example finds only one match, because the second occurrence of "dog" does not start at the end of the previous match.
JAVA Pattern Class Flags

 

 Constant                Equivalent Embedded Flag Expression Pattern.CANON_EQ                           None Pattern.CASE_INSENSITIVE                   (?i) Pattern.COMMENTS                           (?x) Pattern.MULTILINE                          (?m) Pattern.DOTALL                             (?s) Pattern.LITERAL                            None Pattern.UNICODE_CASE                       (?u) Pattern.UNIX_LINES                         (?d)
Using the matches(String,CharSequence) Method

  The Pattern class defines a convenient matches method that allows you to quickly check if a pattern is present in a given input string. As with all public static methods, you should invoke matches by its class name, such as Pattern.matches("\\d","1");. In this example, the method returnstrue, because the digit "1" matches the regular expression \d.

Using the split(String) Method

 

import java.util.regex.Pattern;import java.util.regex.Matcher;public class SplitDemo {    private static final String REGEX = ":";    private static final String INPUT = "one:two:three:four:five";        public static void main(String[] args) {        Pattern p = Pattern.compile(REGEX);        String[] items = p.split(INPUT);        for(String s : items) {            System.out.println(s);        }    }}OUTPUT:onetwothreefourfive

another demo

import java.util.regex.Pattern;
import java.util.regex.Matcher;
 
public class SplitDemo2 {
 
    private static final String REGEX = "\\d";
    private static final String INPUT = "one9two4three7four1five";
 
    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        String[] items = p.split(INPUT);
        for(String s : items) {
            System.out.println(s);
        }
    }
}
OUTPUT:
 
one
two
three
four
five
Index Methods Index methods provide useful index values that show precisely where the match was found in the input string:
  • public int start(): Returns the start index of the previous match.
  • public int start(int group): Returns the start index of the subsequence captured by the given group during the previous match operation.
  • public int end(): Returns the offset after the last character matched.
  • public int end(int group): Returns the offset after the last character of the subsequence captured by the given group during the previous match operation.
Study Methods Study methods review the input string and return a boolean indicating whether or not the pattern is found. Replacement Methods Replacement methods are useful methods for replacing text in an input string.

没有评论:

发表评论