Python RegEx
Introduction
Python is a popular general-purpose programming language that has gained popularity due to its simplicity, readability, and ease of use. Python is extensively used in web development, data science, and machine learning because of its ability to solve complex problems with ease.
Python includes regular expressions (RegEx) to search, extract and manipulate text data. Regular expressions are an essential part of text processing as they can find complex patterns in text strings.
A regular expression can be defined as a sequence of characters that specifies a search pattern. In Python, regular expressions are implemented using the re module.
Matching Patterns using Python RegEx
Python RegEx can match patterns in various ways. Here are some common ways to match patterns:
1. Search Function
The search() function in Python RegEx is used to find the first occurrence of a pattern in a string. Here is an example:
“`
import re
string = “The quick brown fox jumps over the lazy dog”
pattern = “brown”
result = re.search(pattern, string)
if result:
print(“Pattern found!”)
else:
print(“Pattern not found!”)
“`
The search function returns a match object if it finds the pattern. You can use this match object to get more information about the pattern found in the string.
2. Findall Function
The findall() function in Python RegEx is used to find all occurrences of a pattern in a string. Here is an example:
“`
import re
string = “The quick brown fox jumps over the lazy dog”
pattern = “\w+”
result = re.findall(pattern, string)
print(result)
“`
The findall function returns a list of all occurrences of the pattern in the string. In this example, the pattern will match any word characters (\w+) in the string.
3. Sub Function
The sub() function in Python RegEx is used to substitute occurrences of a pattern in a string with a new string. Here is an example:
“`
import re
string = “The quick brown fox jumps over the lazy dog”
pattern = “\s”
result = re.sub(pattern, “-“, string)
print(result)
“`
The sub function returns a new string with all occurrences of the pattern replaced with the replacement string. In this example, the pattern matches any whitespace character (\s) and replaces it with a hyphen (-).
Python RegEx Metacharacters
Python RegEx supports various metacharacters that can be used to define complex patterns. Here are some of the commonly used Python RegEx metacharacters:
1. Dot (.)
The dot (.) metacharacter matches any character except for a newline.
“`
import re
string = “The quick brown fox jumps over the lazy dog”
pattern = “q..ck”
result = re.search(pattern, string)
print(result)
“`
This will match any string that has ‘q’, two any characters, and ‘ck’ such as “quick” in “The quick brown fox jumps over the lazy dog”.
2. Caret (^)
The caret (^) metacharacter matches at the beginning of a string.
“`
import re
string = “The quick brown fox jumps over the lazy dog”
pattern = “^The”
result = re.search(pattern, string)
print(result)
“`
This will match any string that begins with “The”, such as “The quick brown fox jumps over the lazy dog”.
3. Dollar Sign ($)
The dollar sign ($) metacharacter matches at the end of a string.
“`
import re
string = “The quick brown fox jumps over the lazy dog”
pattern = “dog$”
result = re.search(pattern, string)
print(result)
“`
This will match any string that ends with “dog”, such as “The quick brown fox jumps over the lazy dog”.
4. Asterisk (*)
The asterisk (*) metacharacter matches zero or more occurrences of the preceding character.
“`
import re
string = “Thquick brwn fox jumps ovr the lzy dg”
pattern = “q.*ck”
result = re.search(pattern, string)
print(result)
“`
This will match any substring that has ‘q’ followed by any number of characters and then followed by ‘ck’ such as “quick brwn” in “Thquick brwn fox jumps ovr the lzy dg”.
5. Plus Sign (+)
The plus sign (+) metacharacter matches one or more occurrences of the preceding character.
“`
import re
string = “Thquck brwn fox jumps ovr the lzy dg”
pattern = “q.+ck”
result = re.search(pattern, string)
print(result)
“`
This will match any substring that has ‘q’ followed by one or more characters and then followed by ‘ck’ such as “quck brwn” in “Thquck brwn fox jumps ovr the lzy dg”.
6. Question Mark (?)
The question mark (?) metacharacter matches zero or one occurrence of the preceding character.
“`
import re
string = “The quick brown fox jumps over the lazy dog”
pattern = “colou?r”
result = re.search(pattern, string)
print(result)
“`
This will match any string that has “color” or “colour” such as “color” in “The color of the car is red”.
Conclusion
Python RegEx is a powerful tool for searching, extracting, and manipulating text data. Python RegEx provides a range of powerful metacharacters that can be used to define complex patterns. Using Python RegEx, you can easily search and extract data from a large dataset without writing complex code.
Regular expressions are not only limited to Python programming language. RegEx is supported by almost all programming languages like JavaScript, PHP, Java and many more so once you learn Python RegEx, it is very easy to apply it to various other programming languages.