Roumazeilles.net

Regular Expressions for the YGrep Search Engine

 

Search regular expressions

The regular expression routines of the YGrep Search Engine support a full range of Unix regular expressions as defined in ed(1) and in grep(1) . They are also very similar to regular expressions provided by the Emacs or MicroEmacs text editor or Borland utility GREP.COM utility.

Specification

| A vertical bar between expressions forces matches onto either the first expressions OR the second expression. Up to 10 of these can be combined.
& An ampersand between expressions forces matches onto both the first expression AND the second expression. Up to 10 of these can be combined.
^ A circumflex as the first character of the pattern forces matches to beginning of lines.
$ A dollar as the last character of the pattern forces matches to end of lines.
. A period anywhere in the string matches any single character.
* An expression followed by an asterisk matches zero or more occurrences of that expression.
+ An expression followed by a plus sign matches one or more occurrences of that expression.
- An expression followed by a minus sign optionally matches that expression (matches zero or one occurrence of that expression).
{} An expression followed by a number N in curly braces matches the expression N times.
{} An expression followed by two numbers M and N separated by a comma in curly braces matches the expression M to N times. See the examples.
[] A string enclosed in square brackets matches any character in that string, but no others. If the first character of the string is a circumflex the expression matches any character except the characters in the string. A range of characters may be specified by two characters separated by a -. These are known as character classes.
\< A backslash followed by an opening < matches the beginning of a word.
\> A backslash followed by a closing > matches the end of a word.
\( A backslash followed by an opening ( describes the beginning of a tagged sub-expression (see Search and replace regular expressions, it has no effect on search-only expressions). No more than 9 sub-expressions are allowed by the YGrep Search Engine.
\) A backslash followed by a closing ) describes the end of a tagged sub-expression (see Search and replace regular expressions, it has no effect on search-only expressions). No more than 9 sub-expressions are allowed by the YGrep Search Engine.
\b A backslash followed by a letter 'b' matches the backspace character (ASCII code 8).
\n A backslash followed by a letter 'n' matches the newline character (ASCII code 10).
\f A backslash followed by a letter 'f' matches the form-feed character (ASCII code 12).
\r A backslash followed by a letter 'r' matches the carriage return character (ASCII code 13).
\t A backslash followed by a letter 't' matches the horizontal tab character (ASCII code 9).
\x00 A backslash followed by a letter 'x' and a hexadecimal code matches the character with that hexadecimal ASCII code.
\ A backslash followed by any other character quotes that character. This allows a search for a character that is usually a regular expression specifier.

Important note to the users of the C language:

C programmers should remember that in C strings, a backslash is a special character and it should be “doubled” in order to be inserted in the C-declared constant strings. The external user will use the backslash character as described here. For example in a C source program the expression "\<hello\>" should be written as the C string "\\<hello\\>". The error is very common (I do it myself at regular intervals) and may make you scratch you head in front of a bizarre bug.

If an enclosure must contain either the dash ('-') or the closing bracket (']'), these characters must appear at the beginning of the enclosure list like in "[]-]". Please note that ']' must appear before '-'.

Examples

foo|bar|toto matches one of the words: foo, or bar, or toto
foo&bar&toto matches all of the words: foo, and bar, and toto
^Windows matches all lines starting with Windows
Grep$ matches all lines ending with Grep
\$ matches a dollar sign
H..p matches Help, Hoop, Harp, etc.
H.*p matches Help, Hoop, Harp, etc. but also fragments beginning with H and finishing with p like Heeellllp, Holy crop, Halt and stop, etc.
^W.n matches all lines starting with Win, Wan, Won, etc.
fo* matches f, fo, foo, etc.
fo+ matches fo, foo, etc.
fo- matches f, fo, but not foo, fooo, etc.
[xyz] matches x, y and z
a[^xyz]c matches abc, arc and aXb but not axb
([0-9]) matches (0), (1), (2), (3), (4), (5), (6), (7), (8) and (9)
([0-9]*) matches (), (0), (123), (2512), etc.
\<[Aa].*\> matches any non-empty word beginning with either a or A like A, Ab, Abc, a, abC, etc.
a{2} matches two (2) characters a
a{2,4} matches 2, 3 or 4 characters a

About the order of precedence of the | and & operators:

You should be aware of the precedence that is implied by the Boolean operators inside a regular expression. The expression is "explored" from the beginning (from the left). The precedence can be easily considered recognized by saying that a YGrep Search Engine regular expression containing Boolean operators is implicitly parenthesized to the end (to the right). For example:

Expr1 | Expr2 & Expr3 & Expr4 | Expr5

is equivalent to the grouped theoretical expression:

Expr1 | ( Expr2 & ( Expr3 & ( Expr4 | ( Expr5 ))))

Consequently, the exact order of the sub-expressions can be quite important to reach the exact objective of the programmer.

What is the matched with & operator:

The matched string (as returned by the TagStart[0] and TagEnd[0] of the RGrep() RGREPINFO returned value) is a little specific when you use the AND-operator (&) in a regular expression. As a matter of fact, the matched string is the shortest string containing all of the matched string for each of the individual searches.

For example, foo&bar will match on the following line:

Test1 bar schwoop foo test2

The matched string returned in (as returned by the TagStart[0] and TagEnd[0] of the RGrep() RGREPINFO returned value) is the string containing "bar schwoop foo".

Even if this has no specific relation to how it is implemented, in this simple example, the expression foo&bar was equivalent to bar.*foo|foo.*bar, and both match the same string.


Search-and-Replace regular expressions

The YGrep Search Engine regular expression substitution routines support a small set of expressions to define how the substitution will be performed.

Specification

& An ampersand in the substituted string forces insertion of the full matched pattern.
\number A backslash followed by a number (between 1 and 9) forces the insertion of the tag matched with the equivalent number in the pattern.
\0 A backslash followed by a 0 forces the insertion of the the full matched pattern (like &).
\& is an escape sequence to allow the insertion of the & character (while removing its matched pattern meaning).

Examples

Search Pattern Replace Pattern Substitution
Windows MS-& replaces all occurrences of the simple word Windows with MS-Windows
\(dows\)\([Ww]in\) \2\1 allows to reorder the pattern dowsWin into the normal Windows regardless of the letter-casing of the W in the beginning of the word

Other interesting links

General links

Internationalization-related links

MSDN resources:

Non-MSDN resources:

Books of interest to the reader

I have found a few book you may want to buy from Amazon.com.

Jump to Amazon.com


http://www.roumazeilles.net/

Copyright (c) 1999-2008 - Yves Roumazeilles (all rights reserved)

Latest update: 30-oct-08

Google.com
Roumazeilles.net
Roumazeilles.net