(Sponsors) Get started learning Python with DataCamp's free Intro to Python tutorial. Learn Data Science by completing interactive coding challenges and watching videos by expert instructors. Start Now!
Python Regular Expression
Updated on Jan 07, 2020
Regular expression is widely used for pattern matching. Python has built-in support for regular function. To use regular expression you need to import re
module.
import re
Now you are ready to use regular expression.
re.search() Method #
The re.search()
is used to find the first match for the pattern in the string.
Syntax: re.search(pattern, string, flags[optional])
The re.search()
method accepts pattern and string and returns a match
object on success or None
if no match is found. The match
object has group()
method which contains the matching text in the string.
You must specify the pattern using raw strings i.e prepending string with r
like this.
r'this \n'
All the special character and escape sequences loose their special meanings in raw string so \n
is not a newline character, it's just a backslash \
followed by a n
.
1 2 3 4 5 6 7 | >>> import re
>>> s = "my number is 123"
>>> match = re.search(r'\d\d\d', s)
>>> match
<_sre.SRE_Match object; span=(13, 16), match='123'>
>>> match.group()
'123'
|
Above we have used \d\d\d
as pattern. \d
in regular expression matches a single digit, so
\d\d\d
will match digits like 111
, 222
, 786
. It will not match 12
, 1444
.
Basic patterns used in regular expression #
Symbol | Description |
---|---|
. |
dot matches any character except newline |
\w |
matches any word character i.e letters, alphanumeric, digits and underscore (_ ) |
\W |
matches non word characters |
\d |
matches a single digit |
\D |
matches a single character that is not a digit |
\s |
matches any white-spaces character like \n , \t , spaces |
\S |
matches single non white space character |
[abc] |
matches single character in the set i.e either match a , b or c |
[^abc] |
match a single character other than a , b and c |
[a-z] |
match a single character in the range a to z . |
[a-zA-Z] |
match a single character in the range a-z or A-Z |
[0-9] |
match a single character in the range 0 -9 |
^ |
match start at beginning of the string |
$ |
match start at end of the string |
+ |
matches one or more of the preceding character (greedy match). |
* |
matches zero or more of the preceding character (greedy match). |
Let take one more example:
1 2 3 4 5 6 7 8 9 10 | import re
s = "tim email is tim@somehost.com"
match = re.search(r'[\w.-]+@[\w.-]+', s)
# the above regular expression will match a email address
if match:
print(match.group())
else:
print("match not found")
|
Here we have used [\w.-]+@[\w.-]+
pattern to match an email address. On success, re.search()
returns an match
object , and its group()
method will contain the matching text.
Group capturing #
Group capturing allows to extract parts from the matching string. You can create groups using parentheses ()
. Suppose we want to extract username and host name from the email address in the above example. To do this we need to add ()
around username and host name like this.
match = re.search(r'([\w.-]+)@([\w.-]+)', s)
Note that parentheses will not change what the pattern will match. If the match is successful then match.group(1)
will contain the match from the first parentheses and match.group(2)
will contain the match from the second parentheses.
1 2 3 4 5 6 7 | import re
s = "tim email is tim@somehost.com"
match = re.search('([\w.-]+)@([\w.-]+)', s)
if match:
print(match.group()) ## tim@somehost.com (the whole match)
print(match.group(1)) ## tim (the username, group 1)
print(match.group(2)) ## somehost (the host, group 2)
|
findall() Function #
As you know by now re.search()
find only first match for the pattern, what if we want to find all matches in string, this is where findall()
comes into the play.
Syntax: findall(pattern, string, flags=0[optional])
On success, it returns all the matches as a list of strings, otherwise an empty list.
1 2 3 4 5 6 | import re
s = "Tim's phone numbers are 12345-41521 and 78963-85214"
match = re.findall(r'\d{5}', s)
if match:
print(match)
|
Expected Output:
['12345', '41521', '78963', '85214']
You can also use group capturing with findall()
, when group capturing is applied then findall()
returns a list of tuples where tuples will contain the matching groups. An example will clear everything.
1 2 3 4 5 6 7 8 9 10 | import re
s = "Tim's phone numbers are 12345-41521 and 78963-85214"
match = re.findall(r'(\d{5})-(\d{5})', s)
print(match)
for i in match:
print()
print(i)
print("First group", i[0])
print("Second group", i[1])
|
Expected Output:
1 2 3 4 5 6 7 8 9 | [('12345', '41521'), ('78963', '85214')]
('12345', '41521')
First group 12345
Second group 41521
('78963', '85214')
First group 78963
Second group 85214
|
Optional flags #
Both re.search()
and re.findall()
accepts and optional parameter called flags. flags are used to modify the behavior of the pattern matching.
Flags | Description |
---|---|
re.IGNORECASE |
Ignores uppercase and lowercase |
re.DOTALL |
Allows (. ) to match newline, be default (. ) matches any character except newline |
re.MULTILINE |
This will allow ^ and $ to match start and end of each line |
Using re.match() #
The re.match()
is very similar to re.search()
difference is that it will start looking for matches at the beginning of the string.
1 2 3 4 5 | import re
s = "python tuts"
match = re.match(r'py', s)
if match:
print(match.group())
|
You can accomplish the same thing by applying ^
to a pattern with re.search()
.
1 2 3 4 5 | import re
s = "python tuts"
match = re.search(r'^py', s)
if match:
print(match.group())
|
This completes everything you need to know about re
module in python.
Other Tutorials (Sponsors)
This site generously supported by DataCamp. DataCamp offers online interactive Python Tutorials for Data Science. Join over a million other learners and get started learning Python for data science today!
View Comments