Regular Expressions in Python: A Step-by-Step Guide

Regular Expressions in Python A Step-by-Step Guide

Regular expressions (regex) are powerful tools for pattern matching and text manipulation in Python. They allow you to search, match, and replace substrings based on specific patterns. This article covers the fundamentals of regular expressions, the difference between strings and regex strings, the functions in the re module, common expressions, and practical examples for data extraction, culminating in a case study.

Understanding Regular Expressions

A regular expression is a sequence of characters that defines a search pattern. It can include literal characters, metacharacters, and operators to match strings flexibly. Regex is used for validation, extraction, and substitution tasks, such as checking email formats or extracting phone numbers from text.

In Python, regex is handled by the re module, which provides functions to work with patterns.

String vs. Regular Expression String

A regular string is a sequence of characters, like "hello". A regular expression string, often prefixed with r (raw string), treats backslashes (\) literally, avoiding Python's escape sequence interpretation.

# Regular string
s = "\n"  # Interpreted as newline
print(len(s))  # Output: 1

# Raw string (regex string)
rs = r"\n"  # Treated as backslash and 'n'
print(len(rs))  # Output: 2

Raw strings are essential in regex to use metacharacters like \d (digit) without escaping issues.

The “re” Module Functions

The re module provides several functions for regex operations. Import it with import re.

match()

re.match(pattern, string) checks for a match only at the beginning of the string.

import re

pattern = r"\d+"
string = "123abc"
match = re.match(pattern, string)
print(match.group() if match else "No match")  # Output: 123

search()

re.search(pattern, string) searches for the first occurrence anywhere in the string.

match = re.search(r"\d+", "abc123def")
print(match.group() if match else "No match")  # Output: 123

split()

re.split(pattern, string) splits the string by occurrences of the pattern.

result = re.split(r"\s+", "Hello   World!")
print(result)  # Output: ['Hello', 'World!']

findall()

re.findall(pattern, string) returns all non-overlapping matches as a list.

result = re.findall(r"\d+", "abc123def456")
print(result)  # Output: ['123', '456']

compile()

re.compile(pattern) compiles a regex pattern into a regex object for repeated use, improving performance.

pattern = re.compile(r"\d+")
result = pattern.findall("abc123def456")
print(result)  # Output: ['123', '456']

sub()

re.sub(pattern, repl, string) replaces occurrences of the pattern with repl.

result = re.sub(r"\d+", "X", "abc123def456")
print(result)  # Output: abcXdefX

subn()

re.subn(pattern, repl, string) is like sub() but returns a tuple of the new string and the number of substitutions.

result, count = re.subn(r"\d+", "X", "abc123def456")
print(result, count)  # Output: abcXdefX 2

Expressions Using Operators and Symbols

Regex patterns use operators and symbols to define matches.

Simple Character Matches

Literal characters match themselves, e.g., r"abc" matches "abc".

Special Characters

Special characters include:

  • .: Matches any character except newline.
  • ^: Matches start of string.
  • $: Matches end of string.
  • *: Zero or more repetitions.
  • +: One or more repetitions.
  • ?: Zero or one repetition.
  • {m,n}: Between m and n repetitions.
  • |: OR operator.
  • (): Grouping.
  • \: Escape special characters.
pattern = r"a.b"  # Matches 'a' followed by any char followed by 'b'
print(re.search(pattern, "axb").group())  # Output: axb

Character Classes

Character classes define sets of characters:

  • [abc]: Matches a, b, or c.
  • [^abc]: Matches anything except a, b, c.
  • \d: Digit (0-9).
  • \D: Non-digit.
  • \w: Word character (a-z, A-Z, 0-9, _).
  • \W: Non-word character.
  • \s: Whitespace.
  • \S: Non-whitespace.
pattern = r"\d{3}"
print(re.findall(pattern, "123abc456"))  # Output: ['123', '456']

Mobile Number Extraction

Pattern for a 10-digit mobile number: r"\b\d{10}\b".

text = "Contact: 1234567890 or 0987654321"
print(re.findall(r"\b\d{10}\b", text))  # Output: ['1234567890', '0987654321']

Mail Extraction

Basic email pattern: r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b".

text = "Email: user@example.com"
print(re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", text))  # Output: ['user@example.com']

Different Mail ID Patterns

Variations include subdomains or country codes, but the basic pattern covers most.

text = "Emails: user.name@sub.example.co.uk, test@domain.com"
print(re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", text))  # Output: ['user.name@sub.example.co.uk', 'test@domain.com']

Data Extraction

Extract dates in YYYY-MM-DD format: r"\d{4}-\d{2}-\d{2}".

text = "Dates: 2023-01-01 and 2024-12-31"
print(re.findall(r"\d{4}-\d{2}-\d{2}", text))  # Output: ['2023-01-01', '2024-12-31']

Password Extraction

Pattern for strong passwords (8+ chars, mix of types): r"(?=\w{8,})(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[\w@$!%*?&]+".

text = "Password: Passw0rd!"
print(re.findall(r"(?=\w{8,})(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[\w@$!%*?&]+", text))  # Output: ['Passw0rd!']

URL Extraction

Basic URL pattern: r"https?://[\w.-]+\.[a-zA-Z]{2,}(/\S*)?".

text = "Visit https://example.com/path"
print(re.findall(r"https?://[\w.-]+\.[a-zA-Z]{2,}(/\S*)?", text))  # Output: ['https://example.com/path']

Vehicle Number Extraction

Pattern for Indian vehicle numbers (e.g., AB12CD3456): r"[A-Z]{2}\d{2}[A-Z]{2}\d{4}".

text = "Vehicle: MH12AB1234"
print(re.findall(r"[A-Z]{2}\d{2}[A-Z]{2}\d{4}", text))  # Output: ['MH12AB1234']

Case Study: Log File Analysis

Suppose you have a log file with entries like: "ERROR 2023-01-01 12:00:00 - User ID: 123 - Invalid login from IP: 192.168.1.1". Use regex to extract errors, dates, user IDs, and IPs.

log = "ERROR 2023-01-01 12:00:00 - User ID: 123 - Invalid login from IP: 192.168.1.1"
error_pattern = r"ERROR (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) - User ID: (\d+) - .* IP: (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
match = re.search(error_pattern, log)
if match:
    date, user_id, ip = match.groups()
    print(f"Date: {date}, User: {user_id}, IP: {ip}")  # Output: Date: 2023-01-01 12:00:00, User: 123, IP: 192.168.1.1

This case study demonstrates extracting structured data from logs using regex groups for analysis or reporting.

Conclusion

Regular expressions in Python, powered by the re module, are versatile for text processing. From basic matching to complex data extraction, mastering regex enhances your ability to handle strings efficiently. Practice with the examples above to apply these concepts in your projects!

Previous Post