Regular Expressions in Python: A Step-by-Step Guide
Regular expressions (regex) are powerful tools for pattern matching and text manipulation in Python. They allow you to search, match, and replace substrings based on specific patterns. This article covers the fundamentals of regular expressions, the difference between strings and regex strings, the functions in the re module, common expressions, and practical examples for data extraction, culminating in a case study.
Understanding Regular Expressions
A regular expression is a sequence of characters that defines a search pattern. It can include literal characters, metacharacters, and operators to match strings flexibly. Regex is used for validation, extraction, and substitution tasks, such as checking email formats or extracting phone numbers from text.
In Python, regex is handled by the re module, which provides functions to work with patterns.
String vs. Regular Expression String
A regular string is a sequence of characters, like "hello". A regular expression string, often prefixed with r (raw string), treats backslashes (\) literally, avoiding Python's escape sequence interpretation.
# Regular string
s = "\n" # Interpreted as newline
print(len(s)) # Output: 1
# Raw string (regex string)
rs = r"\n" # Treated as backslash and 'n'
print(len(rs)) # Output: 2
Raw strings are essential in regex to use metacharacters like \d (digit) without escaping issues.
The “re” Module Functions
The re module provides several functions for regex operations. Import it with import re.
match()
re.match(pattern, string) checks for a match only at the beginning of the string.
import re
pattern = r"\d+"
string = "123abc"
match = re.match(pattern, string)
print(match.group() if match else "No match") # Output: 123
search()
re.search(pattern, string) searches for the first occurrence anywhere in the string.
match = re.search(r"\d+", "abc123def")
print(match.group() if match else "No match") # Output: 123
split()
re.split(pattern, string) splits the string by occurrences of the pattern.
result = re.split(r"\s+", "Hello World!")
print(result) # Output: ['Hello', 'World!']
findall()
re.findall(pattern, string) returns all non-overlapping matches as a list.
result = re.findall(r"\d+", "abc123def456")
print(result) # Output: ['123', '456']
compile()
re.compile(pattern) compiles a regex pattern into a regex object for repeated use, improving performance.
pattern = re.compile(r"\d+")
result = pattern.findall("abc123def456")
print(result) # Output: ['123', '456']
sub()
re.sub(pattern, repl, string) replaces occurrences of the pattern with repl.
result = re.sub(r"\d+", "X", "abc123def456")
print(result) # Output: abcXdefX
subn()
re.subn(pattern, repl, string) is like sub() but returns a tuple of the new string and the number of substitutions.
result, count = re.subn(r"\d+", "X", "abc123def456")
print(result, count) # Output: abcXdefX 2
Expressions Using Operators and Symbols
Regex patterns use operators and symbols to define matches.
Simple Character Matches
Literal characters match themselves, e.g., r"abc" matches "abc".
Special Characters
Special characters include:
.: Matches any character except newline.^: Matches start of string.$: Matches end of string.*: Zero or more repetitions.+: One or more repetitions.?: Zero or one repetition.{m,n}: Between m and n repetitions.|: OR operator.(): Grouping.\: Escape special characters.
pattern = r"a.b" # Matches 'a' followed by any char followed by 'b'
print(re.search(pattern, "axb").group()) # Output: axb
Character Classes
Character classes define sets of characters:
[abc]: Matches a, b, or c.[^abc]: Matches anything except a, b, c.\d: Digit (0-9).\D: Non-digit.\w: Word character (a-z, A-Z, 0-9, _).\W: Non-word character.\s: Whitespace.\S: Non-whitespace.
pattern = r"\d{3}"
print(re.findall(pattern, "123abc456")) # Output: ['123', '456']
Mobile Number Extraction
Pattern for a 10-digit mobile number: r"\b\d{10}\b".
text = "Contact: 1234567890 or 0987654321"
print(re.findall(r"\b\d{10}\b", text)) # Output: ['1234567890', '0987654321']
Mail Extraction
Basic email pattern: r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b".
text = "Email: user@example.com"
print(re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", text)) # Output: ['user@example.com']
Different Mail ID Patterns
Variations include subdomains or country codes, but the basic pattern covers most.
text = "Emails: user.name@sub.example.co.uk, test@domain.com"
print(re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", text)) # Output: ['user.name@sub.example.co.uk', 'test@domain.com']
Data Extraction
Extract dates in YYYY-MM-DD format: r"\d{4}-\d{2}-\d{2}".
text = "Dates: 2023-01-01 and 2024-12-31"
print(re.findall(r"\d{4}-\d{2}-\d{2}", text)) # Output: ['2023-01-01', '2024-12-31']
Password Extraction
Pattern for strong passwords (8+ chars, mix of types): r"(?=\w{8,})(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[\w@$!%*?&]+".
text = "Password: Passw0rd!"
print(re.findall(r"(?=\w{8,})(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[\w@$!%*?&]+", text)) # Output: ['Passw0rd!']
URL Extraction
Basic URL pattern: r"https?://[\w.-]+\.[a-zA-Z]{2,}(/\S*)?".
text = "Visit https://example.com/path"
print(re.findall(r"https?://[\w.-]+\.[a-zA-Z]{2,}(/\S*)?", text)) # Output: ['https://example.com/path']
Vehicle Number Extraction
Pattern for Indian vehicle numbers (e.g., AB12CD3456): r"[A-Z]{2}\d{2}[A-Z]{2}\d{4}".
text = "Vehicle: MH12AB1234"
print(re.findall(r"[A-Z]{2}\d{2}[A-Z]{2}\d{4}", text)) # Output: ['MH12AB1234']
Case Study: Log File Analysis
Suppose you have a log file with entries like: "ERROR 2023-01-01 12:00:00 - User ID: 123 - Invalid login from IP: 192.168.1.1". Use regex to extract errors, dates, user IDs, and IPs.
log = "ERROR 2023-01-01 12:00:00 - User ID: 123 - Invalid login from IP: 192.168.1.1"
error_pattern = r"ERROR (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) - User ID: (\d+) - .* IP: (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
match = re.search(error_pattern, log)
if match:
date, user_id, ip = match.groups()
print(f"Date: {date}, User: {user_id}, IP: {ip}") # Output: Date: 2023-01-01 12:00:00, User: 123, IP: 192.168.1.1
This case study demonstrates extracting structured data from logs using regex groups for analysis or reporting.
Conclusion
Regular expressions in Python, powered by the re module, are versatile for text processing. From basic matching to complex data extraction, mastering regex enhances your ability to handle strings efficiently. Practice with the examples above to apply these concepts in your projects!
