Python Regular Expression

Updated on Jan 07, 2020

Regular expression is widely used for pattern matching. Python has built-in support for regular function. To use regular expression you need to import re module.

import re

Now you are ready to use regular expression.

re.search() Method #

The re.search() is used to find the first match for the pattern in the string.

Syntax: re.search(pattern, string, flags[optional])

The re.search() method accepts pattern and string and returns a match object on success or None if no match is found. The match object has group() method which contains the matching text in the string.

You must specify the pattern using raw strings i.e prepending string with r like this.

r'this \n'

All the special character and escape sequences loose their special meanings in raw string so \n is not a newline character, it's just a backslash \ followed by a n.

>>> import re
>>> s = "my number is 123"
>>> match = re.search(r'\d\d\d', s)
>>> match
<_sre.SRE_Match object; span=(13, 16), match='123'>
>>> match.group()
'123'

Above we have used \d\d\d as pattern. \d in regular expression matches a single digit, so

\d\d\d will match digits like 111, 222, 786. It will not match 12, 1444.

Basic patterns used in regular expression #

Symbol	Description
`.`	dot matches any character except newline
`\w`	matches any word character i.e letters, alphanumeric, digits and underscore (`_`)
`\W`	matches non word characters
`\d`	matches a single digit
`\D`	matches a single character that is not a digit
`\s`	matches any white-spaces character like `\n`, `\t`, spaces
`\S`	matches single non white space character
`[abc]`	matches single character in the set i.e either match `a`, `b` or `c`
`[^abc]`	match a single character other than `a`, `b` and `c`
`[a-z]`	match a single character in the range `a` to `z`.
`[a-zA-Z]`	match a single character in the range a-z or A-Z
`[0-9]`	match a single character in the range `0`-`9`
`^`	match start at beginning of the string
`$`	match start at end of the string
`+`	matches one or more of the preceding character (greedy match).
`*`	matches zero or more of the preceding character (greedy match).

Let take one more example:

import re
s = "tim email is tim@somehost.com"
match = re.search(r'[\w.-]+@[\w.-]+', s)

# the above regular expression will match a email address

if match:
    print(match.group())
else:
    print("match not found")

Here we have used [\w.-]+@[\w.-]+ pattern to match an email address. On success, re.search() returns an match object , and its group() method will contain the matching text.

Group capturing #

Group capturing allows to extract parts from the matching string. You can create groups using parentheses (). Suppose we want to extract username and host name from the email address in the above example. To do this we need to add () around username and host name like this.

match = re.search(r'([\w.-]+)@([\w.-]+)', s)

Note that parentheses will not change what the pattern will match. If the match is successful then match.group(1) will contain the match from the first parentheses and match.group(2) will contain the match from the second parentheses.

import re
s = "tim email is tim@somehost.com"
match = re.search('([\w.-]+)@([\w.-]+)', s)
if match:
    print(match.group()) ## tim@somehost.com (the whole match)
    print(match.group(1)) ## tim (the username, group 1)
    print(match.group(2)) ## somehost (the host, group 2)

findall() Function #

As you know by now re.search() find only first match for the pattern, what if we want to find all matches in string, this is where findall() comes into the play.

Syntax: findall(pattern, string, flags=0[optional])

On success, it returns all the matches as a list of strings, otherwise an empty list.

import re
s = "Tim's phone numbers are 12345-41521 and 78963-85214"
match = re.findall(r'\d{5}', s)

if match:
    print(match)

Expected Output:

['12345', '41521', '78963', '85214']

You can also use group capturing with findall(), when group capturing is applied then findall() returns a list of tuples where tuples will contain the matching groups. An example will clear everything.

import re
s = "Tim's phone numbers are 12345-41521 and 78963-85214"
match = re.findall(r'(\d{5})-(\d{5})', s)
print(match)

for i in match:
    print()
    print(i)
    print("First group", i[0])
    print("Second group", i[1])

Expected Output:

[('12345', '41521'), ('78963', '85214')]

('12345', '41521')
First group 12345
Second group 41521

('78963', '85214')
First group 78963
Second group 85214

Optional flags #

Both re.search() and re.findall() accepts and optional parameter called flags. flags are used to modify the behavior of the pattern matching.

Flags	Description
`re.IGNORECASE`	Ignores uppercase and lowercase
`re.DOTALL`	Allows (`.`) to match newline, be default (`.`) matches any character except newline
`re.MULTILINE`	This will allow `^` and `$` to match start and end of each line

Using re.match() #

The re.match() is very similar to re.search() difference is that it will start looking for matches at the beginning of the string.

import re
s = "python tuts"
match = re.match(r'py', s)
if match:
    print(match.group())

You can accomplish the same thing by applying ^ to a pattern with re.search().

import re
s = "python tuts"
match = re.search(r'^py', s)
if match:
    print(match.group())

This completes everything you need to know about re module in python.