Tuesday 13 December 2011

Regular Expression Tutorial



Definition:
Regular Expression is set of characters describing particular sequence of characters or pattern.

Introduction:
It is used to specify particular pattern to search in a stream of text.
Simple word search can be done easily every text editor provides this.
 
But with regular expression you can do more complex searches like search for particular word ending with vowels , numbers etc.

There are 3 important parts to regular expression

1. Anchors
    Used to specify position of pattern with respect to line of text
   
2. Character Sets
    Match one or more characters in single position

3. Modifiers
    Specify how many times previous character set is repeated

For example : "^#*"

Anchor             -> "^" : indicates begining of line
Character Set  -> "#" : this is just a character "#"
Modifier           ->  "*" : this is modifier which indicates that previous character set can appear
                                    any number of times including zero


Anchor characters are "^" and "$"

^ indicates start of line
$ indicates end of line

e.g. ^B will match all lines which starts with capital B
       B$ will match all lines which ends with capital B

^ must appear at the start of regular expression and $ must appear at the end other wise they dont act as anchors

Use . to match any character
e.g. ^.$
The pattern will match a line with any character

Use [...] to specify range of characters
e.g. ^[0123456789]$ or ^[0-9]$

Now a little complex regular expression
 Suppose you want to search for a word

    1. which starts with capital letter "S".
    2. It is the first word on a line
    3. The second letter is a lower case letter
    4. Is exactly three letters long, and
    5. the third letter is a vowel 


then regular expression would be : "^S[a-z][aeiou] "


For Repeating Character Sets use *
The modifier part of regular expression indicates how many times you expect to see previous character set.

For example "1*" matches 1 or many 1s.

Another example "^#*"
will match each and every line because the regular expression indicates that character set "#" can appear zero or many times from the begining of line.
That means every line will be selected irrespective of wether it has "#" character in it or not ( as * indicates # can appear many or zero times )

For matching specific sets use {  }

You can not sepcify maximum number of repeats for character set with *.
But there is provision to do so with { }. You can specify maximum and minimum number of repeats.

You have to use a "\" while using  { }.
There is one more difference here worthy of note.
Normally "\" acts as escape character but here in context with  { } brackets it gives them special meaning.

For example: ^A\{4,8\}B
Will match any line starting with capital A and having repeating A's for 4 , 5 , 6, 7, or 8 times followed by B

Matching Words use < >
Searching for a word is a little tricky.
Search for "the" will result into selecting line with "other" too as it has "the" in it.
You can include space after and before the word to make it actually match the word.
For example " the "

But problem here is it wont match the word with capital T or having some punchcuation in it.
So here comes < , > for our rescue.
We will have to use something like "\<[tT]he\>".
The characters < , > act like ^ , $ but in context of word.
 


Thats all for now I hope this is very helpful to you all.


Tags:
Regular Expression, Regular Expression Tutorial, tutorial on regular expressions, tutorial on regular expressions in java, tutorial on regular expressions in javascript, tutorial on regular expressions python, tutorial on regular expressions in perl,tutorial on regular expressions in oracle,regular expression examples

No comments:

Post a Comment