A regular expression in a programming language is a special text string used for describing a search pattern. It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents.
The most common uses of regular expressions are:
A regular expression can be formed by using the mix of meta-characters and special sequences.
Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:
Metacharacter is a character with the specified meaning.
Metacharacter | Description |
---|---|
[] |
It represents the set of characters. |
. |
It represents any character except new line character. |
^ |
It represents characters at the beginning of the string. |
$ |
It represents characters at the ending of the string. |
* |
It represents zero or more occurrences of a pattern in the string. |
+ |
It represents one or more occurrences of a pattern in the string. |
{ } |
It represents exactly the specified number of occurrences. |
| |
It represents either this or that character is present. |
( ) |
Capture and group(i.e group sub-patterns) |
\ |
It represents the special sequence. |
Square brackets specify a set of characters you wish to match.
Here, [abc] will match if the string you are trying to match contains any of the a, b or c.
You can also specify a range of characters using - inside square brackets.
[a-e] is the same as [abcde]
[1-4] is the same as [1234]
[0-39] is the same as [01239]
You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.
[^abc] means any character except a or b or c.
[^0-9] means any non-digit character.
A period matches any single character (except newline '\n').
Expression String Matched? a...n abn No match alian Match abysn Match Alian No match An abacus No match
The caret symbol ^ is used to check if a string starts with a certain character.
Expression String Matched?
^a a Match abc Match bac No match ^ab abc match acb No match (starts with a, not followed by b)
The dollar symbol $ is used to check if a string ends with a certain character.
Expression String Matched?
a$ a match formula match cab No match
The star symbol * matches zero or more occurrences of the pattern left to it.
Expression String Matched?
ma*n mn Match man Match maaan Match main No match (a is not followed by n) woman Match
The plus symbol + matches one or more occurrences of the pattern left to it.
Expression String Matched?
ma+n mn No match (no a character) man Match maaan Match main No match (a is not followed by n) woman Match
Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.
Expression String Matched?
a{2,3} abc dat No match abc daat 1 match (at daat) aabc daaat 2 matches (at aabc and daaat) aabc daaaat 2 matches (at aabc and daaaat)
Vertical bar | is used for alternation (or operator).
Expression String Matched?
a|b cde No match ade 1 match (match at ade) acdbea 3 matches (at acdbea)
Here, a|b match any string that contains either a or b
Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz
Expression String Matched?
(a|b|c)xz ab xz No match abxz 1 match (match at abxz) axz cabxz 2 matches (at axzbc cabxz)
Backlash \ is used to escape various characters including all metacharacters.
For example,
\$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.
If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:
Character | Description |
---|---|
\A |
It returns a match if the specified characters are present at the beginning of the string. |
\d |
It returns a match if the string contains digits [0-9]. |
\D |
It returns a match if the string doesn't contain the digits [0-9]. |
\s |
It returns a match if the string contains any white space character. |
\S |
It returns a match if the string doesn't contain any white space character. |
\w |
It returns a match if the string contains any word characters. |
\W |
It returns a match if the string doesn't contain any word. |
\Z |
Returns a match if the specified characters are at the end of the string. |
Expression String Matched?
\Athe the sun Match In the sun No match
Expression String Matched?
\d 12abc3 3 matches (at 12abc3) Python No match
Expression String Matched?
\D 1ab34"50 3 matches (at 1ab34"50) 1345 No match
Expression String Matched?
\s Python RegEx 1 match PythonRegEx No match
Expression String Matched?
\S a b 2 matches (at a b) No match
Expression String Matched?
\w 12&": ;c 3 matches (at 12&": ;c) %"> ! No match
Expression String Matched?
\W 1a2%c 1 match (at 1a2%c) Python No match
Expression String Matched?
\ZPython I like Python 1 match I like Python. No match Python is fun. No match