Regular Expressions Syntax

Literals

All characters are taken literally except the following:
".", "|", "*", "?", "+", "(", ")", "{", "}", "[", "]", "^", "$" and "\".
These characters have special meaning and must be preceded by a "\" to be taken literally.

Wildcards

The dot "." matches any characters including new line symbols [CR] and [LF].

Repeats

An expression followed by "*" can be repeated any number of times including zero.
An expression followed by "+" can be repeated any number of times excluding zero.
An expression followed by "?" can be repeated no more than one time.
The bounds "{" "}" may be used to specify number of repetitions: "{N}" means that the expression must be repeated N times, "{N,M}" means that the expression must be repeated N to M times.

Subexpressions and parenthesis

Parenthesis "(" ")" are used to mark subexpressions which which are counted starting from 1 from left to right. Subexpression zero is the whole match of the expression.

Alternatives

Alternative expressions are separated by "|" or put on separate lines in the expression.

Line anchors

The empty string at the beginning of line is matched by "^" character.
The empty string at the end of line is matched by "$" character.

Text anchors

"\`" matches the start of the whole text.
"\A" matches the start of the whole text.
"\'" matches the end of a whole text.
"\z" matches the end of a whole text.
"\Z" matches the end of a whole text, or any new line characters at the end.

Character sets

The character set enclosed in brackets "[" "]" matches any symbol it contains, for example "[abc]" matches either "a", "b" or "c".
Sets that start with "^" matches any character that is not member of the set, for example "[^abc]" matches any character except "a", "b" and "c".
Character ranges can be specified as "[a-d]", which matches any symbol betweed "a" and "d".
Character classes are denoted by "[:class:]" within a set declaration.
Commonly used character sets are:

[:alnum:]	Alpha numeric character.
[:alpha:]	Alphabetical character a-z and A-Z.
[:blank:]	Blank character, either a space or a tab.
[:cntrl:]	Control character.
[:digit:]	Digit 0-9.
[:graph:]	Graphical character.
[:lower:]	Lower case character a-z.
[:print:]	Printable character.
[:punct:]	Punctuation character.
[:space:]	Whitespace character.
[:upper:]	Upper case character A-Z.
[:xdigit:]	Hexadecimal digit character, 0-9, a-f and A-F.
[:word:]	Word character - all alphanumeric characters plus the underscore.
[:Unicode:]	Character whose code is greater than 255, this applies to the Unicode characters only.

Character codes

The characters may be matched by octal code "\0NNN" or hexademical code "\xHH", enclosed in brackets "{" "}" if necessary: "\0{NNN}" "\x{HH}".

Word operators

"\<" matches the null string at the start of a word.
"\>" matches the null string at the end of the word.
"\b" matches the null string at either the start or the end of a word.
"\B" matches a null string within a word.
The beginning of the text is a potential start of the word and the end of the text is a potential end of the word.

Back references

Subexpressions may be identified and the matched text used further in the expression by labels "\1" to "\9".

Miscellaneous escape sequences

\w	Equivalent to [[:word:]].
\W	Equivalent to [^[:word:]].
\s	Equivalent to [[:space:]].
\S	Equivalent to [^[:space:]].
\d	Equivalent to [[:digit:]].
\D	Equivalent to [^[:digit:]].
\l	Equivalent to [[:lower:]].
\L	Equivalent to [^[:lower:]].
\u	Equivalent to [[:upper:]].
\U	Equivalent to [^[:upper:]].
\C	Any single character, equivalent to ".".
\X	Match any Unicode combining character sequence, for example "a\x 0301" (a letter a with an acute).
\Q	The begin quote operator, everything that follows is treated as a literal character until a \E end quote operator is found.
\E	The end quote operator, terminates a sequence started with \Q.
\a	Bell character 0x07.
\f	Form feed character 0x0C.
\n	Newline character 0x0A.
\r	Carriage return character 0x0D.
\t	Tab character 0x09.
\v	Vertical tab character 0x0B.
\e	ASCII Escape character 0x1B.
\0dd	An octal character code, where dd is one or more octal digits.
\xXX	A hexadecimal character code, where XX is one or more hexadecimal digits.
\x{XX}	A hexadecimal character code, where XX is one or more hexadecimal digits, optionally a Unicode character.
\cZ	An ASCII escape sequence control-Z, where Z is any ASCII character greater than or equal to the character code for '@'.