Regular Expressions Syntax
Literals
All characters are taken literally except the following:
".", "|", "*", "?", "+", "(", ")", "{", "}", "[", "]", "^", "$" and "\".
These characters have special meaning and must be preceded by a "\" to be taken literally.
Wildcards
The dot "." matches any characters including new line symbols [CR] and [LF].
Repeats
An expression followed by "*" can be repeated any number of times including zero.
An expression followed by "+" can be repeated any number of times excluding zero.
An expression followed by "?" can be repeated no more than one time.
The bounds "{" "}" may be used to specify number of repetitions:
"{N}" means that the expression must be repeated N times,
"{N,M}" means that the expression must be repeated N to M times.
Subexpressions and parenthesis
Parenthesis "(" ")" are used to mark subexpressions which which are counted starting from 1 from left to right.
Subexpression zero is the whole match of the expression.
Alternatives
Alternative expressions are separated by "|" or put on separate lines in the expression.
Line anchors
The empty string at the beginning of line is matched by "^" character.
The empty string at the end of line is matched by "$" character.
Text anchors
"\`" matches the start of the whole text.
"\A" matches the start of the whole text.
"\'" matches the end of a whole text.
"\z" matches the end of a whole text.
"\Z" matches the end of a whole text, or any new line characters at the end.
Character sets
The character set enclosed in brackets "[" "]" matches any symbol it contains,
for example "[abc]" matches either "a", "b" or "c".
Sets that start with "^" matches any character that is not member of the set,
for example "[^abc]" matches any character except "a", "b" and "c".
Character ranges can be specified as "[a-d]", which matches any symbol betweed "a" and "d".
Character classes are denoted by "[:class:]" within a set declaration.
Commonly used character sets are:
[:alnum:] | Alpha numeric character. |
[:alpha:] | Alphabetical character a-z and A-Z. |
[:blank:] | Blank character, either a space or a tab. |
[:cntrl:] | Control character. |
[:digit:] | Digit 0-9. |
[:graph:] | Graphical character. |
[:lower:] | Lower case character a-z. |
[:print:] | Printable character. |
[:punct:] | Punctuation character. |
[:space:] | Whitespace character. |
[:upper:] | Upper case character A-Z. |
[:xdigit:] | Hexadecimal digit character, 0-9, a-f and A-F. |
[:word:] | Word character - all alphanumeric characters plus the underscore. |
[:Unicode:] | Character whose code is greater than 255, this applies to the Unicode characters only. |
Character codes
The characters may be matched by octal code "\0NNN" or hexademical code "\xHH",
enclosed in brackets "{" "}" if necessary: "\0{NNN}" "\x{HH}".
Word operators
"\<" matches the null string at the start of a word.
"\>" matches the null string at the end of the word.
"\b" matches the null string at either the start or the end of a word.
"\B" matches a null string within a word.
The beginning of the text is a potential start of the word and the end of the text is a potential end of the word.
Back references
Subexpressions may be identified and the matched text used further in the expression by labels "\1" to "\9".
Miscellaneous escape sequences
\w | Equivalent to [[:word:]]. |
\W | Equivalent to [^[:word:]]. |
\s | Equivalent to [[:space:]]. |
\S | Equivalent to [^[:space:]]. |
\d | Equivalent to [[:digit:]]. |
\D | Equivalent to [^[:digit:]]. |
\l | Equivalent to [[:lower:]]. |
\L | Equivalent to [^[:lower:]]. |
\u | Equivalent to [[:upper:]]. |
\U | Equivalent to [^[:upper:]]. |
\C | Any single character, equivalent to ".". |
\X | Match any Unicode combining character sequence, for example "a\x 0301" (a letter a with an acute). |
\Q | The begin quote operator, everything that follows is treated as a literal character until a \E end quote operator is found. |
\E | The end quote operator, terminates a sequence started with \Q. |
\a | Bell character 0x07. |
\f | Form feed character 0x0C. |
\n | Newline character 0x0A. |
\r | Carriage return character 0x0D. |
\t | Tab character 0x09. |
\v | Vertical tab character 0x0B. |
\e | ASCII Escape character 0x1B. |
\0dd | An octal character code, where dd is one or more octal digits. |
\xXX | A hexadecimal character code, where XX is one or more hexadecimal digits. |
\x{XX} | A hexadecimal character code, where XX is one or more hexadecimal digits, optionally a Unicode character. |
\cZ | An ASCII escape sequence control-Z, where Z is any ASCII character greater than or equal to the character code for '@'. |