Python Menu

Regular Expressions in Python


A regular expression in a programming language is a special text string used for describing a search pattern. It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents.


The most common uses of regular expressions are:

  • Search a string (search and match)
  • Finding a string (findall)
  • Break string into a sub strings (split)
  • Replace part of a string (sub)

Special Symbols and MetaCharacters

A regular expression can be formed by using the mix of meta-characters and special sequences.

MetaCharacters

Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

Metacharacter is a character with the specified meaning.

Metacharacter Description
[] It represents the set of characters.
. It represents any character except new line character.
^ It represents characters at the beginning of the string.
$ It represents characters at the ending of the string.
* It represents zero or more occurrences of a pattern in the string.
+ It represents one or more occurrences of a pattern in the string.
{ } It represents exactly the specified number of occurrences.
| It represents either this or that character is present.
( ) Capture and group(i.e group sub-patterns)
\ It represents the special sequence.

[ ] - Square brackets

Square brackets specify a set of characters you wish to match.

Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

You can also specify a range of characters using - inside square brackets.

[a-e] is the same as [abcde]
[1-4] is the same as [1234]
[0-39] is the same as [01239]

You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

[^abc] means any character except a or b or c.
[^0-9] means any non-digit character.

. - Period

A period matches any single character (except newline '\n').

Expression	String		Matched?
 a...n		abn		No match
		alian		Match
		abysn		Match
		Alian		No match
		An abacus	No match

^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.

Expression String Matched?

^a	a	Match
	abc	Match
	bac	No match
^ab	abc	match
	acb	No match (starts with a, not followed by b)

$ - Dollar

The dollar symbol $ is used to check if a string ends with a certain character.

Expression String Matched?

a$	a		match
	formula		match
	cab		No match

* - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

Expression String Matched?

ma*n	mn	Match
	man	Match
	maaan	Match
	main	No match (a is not followed by n)
	woman	Match

+ - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

Expression String Matched?

ma+n	mn	No match (no a character)
	man	Match
	maaan	Match
	main	No match (a is not followed by n)
	woman	Match

{ } - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

Expression String Matched?

a{2,3}	abc dat		No match
	abc daat	1 match (at daat)
	aabc daaat	2 matches (at aabc and daaat)
	aabc daaaat	2 matches (at aabc and daaaat)

| - Alternation

Vertical bar | is used for alternation (or operator).

Expression String Matched?

a|b	cde	No match
	ade	1 match (match at ade)
	acdbea	3 matches (at acdbea)

Here, a|b match any string that contains either a or b

( ) - Group

Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

Expression String Matched?

(a|b|c)xz	ab xz		No match
		abxz		1 match (match at abxz)
		axz cabxz	2 matches (at axzbc cabxz)

\ - Backslash

Backlash \ is used to escape various characters including all metacharacters.

For example,
\$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.

Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

Character Description
\A It returns a match if the specified characters are present at the beginning of the string.
\d It returns a match if the string contains digits [0-9].
\D It returns a match if the string doesn't contain the digits [0-9].
\s It returns a match if the string contains any white space character.
\S It returns a match if the string doesn't contain any white space character.
\w It returns a match if the string contains any word characters.
\W It returns a match if the string doesn't contain any word.
\Z Returns a match if the specified characters are at the end of the string.

☞ \A

- Matches if the specified characters are at the start of a string.

Expression String Matched?

\Athe	the sun		Match
	In the sun	No match

☞ \d

- Matches any decimal digit. Equivalent to [0-9]

Expression String Matched?

\d	12abc3		3 matches (at 12abc3)
	Python		No match

☞ \D

- Matches any non-decimal digit. Equivalent to [^0-9]

Expression String Matched?

\D	1ab34"50	3 matches (at 1ab34"50)
	1345		No match

☞ \s

- Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].

Expression String Matched?

\s	Python RegEx		1 match
	PythonRegEx		No match

☞ \S

- Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].

Expression String Matched?

\S	a b		2 matches (at a b)
			No match

☞ \w

- Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

Expression String Matched?

\w	12&": ;c	3 matches (at 12&": ;c)
	%"> !		No match

☞ \W

- Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

Expression String Matched?

\W	1a2%c		1 match (at 1a2%c)
	Python		No match

☞ \Z

- Matches if the specified characters are at the end of a string.

Expression String Matched?

\ZPython	I like Python		1 match
		I like Python.		No match
		Python is fun.		No match


Next Topic :RegEx Module(re module) in Python