Rustling up Robustness: Mastering Regex in Rust

ยท

12 min read

Rustling up Robustness: Mastering Regex in Rust

Introduction

Pattern matching is a process of checking whether a sequence of characters exists in a given text. It is typically used in programs for input validation and text replacement, amongst other things.

Pattern matching is achievable through the use of Regular Expression, which is usually contracted to Regex [or Regexp].

Regex may also imply the pattern used by the regex engine or parser to validate data, retrieve matches, or replace them.

In this article, we shall explore in-depth, what Regular Expressions are and how they may be implemented in the Rust Programming language.

How to regex | regex-memes | ProgrammerHumor.io

Regex Syntax

At a glance, the regular expression is made of rather cryptic arrangements of letters and special characters. This section is positioned to demystify them.

The regex syntax is made of different classifications which are listed thus:

  1. Character classes

  2. Assertion

  3. Groups and backreferences

  4. Quantifiers

Character Classes

The character classes form the basis of regular expressions, they distinguish different types of characters, for example, letters, digits, Unicode, etc.

The common character classes are listed thus:

  • \d - matches any Arabic digit (0-9)

  • \w matches letters of the alphabet ( a-z, A-Z)

  • \D matches the non-digits character, for example, G will be matched in G5

  • \W matches a non-alphabet character, for example, 5 will be matched in G5

  • \s matches single space character

  • \S matches characters other than the white space character

  • . Matches zero or more characters, other than the line termination such as carriage return \r, a new line \n, line break, \b, etc. For example \w.

  • + matches one or more characters

It would be Game Over for me... | try-memes, regex-memes, IT-memes | ProgrammerHumor.io

Assertions

Assertion includes the boundaries of a pattern. Typically, they represent the start and end of characters.

The common syntax used for the assertion is listed thus:

  • ^ Indicates the beginning of a match, for example, when the case-insensitive flag is not set, the regex ^A will match A in A Swiss army knife but not in a Swiss army knife

  • $ indicates the end of a match, for example, knife$ will be matched in A Swiss army knife using knife$

  • \b matches the boundary of a character

  • x(?=y) matches x only if it is followed by y

  • x(?!y) matches x if only it's not flowed by y

  • (?<=y)x matches x only if it is preceded by y

  • (?!y)x matches x only if it is not preceded by y

Groups and backreferences

In regular expressions, and, as we shall see shortly, Group allow us to group multiple patterns together as a single unit, while Back references allow us to use a previously captured group. Consider the syntax below:

  • (x) a capture group that matches a pattern x, the pattern can be referenced later using $1, $2, ... $n for the capture group

  • (?:x) capture a group x but does not remember x, or allow the pattern to be referenced later using $1, $2, ... $n

Quantifiers

As the name implies, quantifiers deal with counts; the number of characters or expressions to be matched.

Consider the syntax below.

  • * Matches the preceding item zero or more time. For example (p*) will match pope, poppppe, etc

  • + Matches the preceding item one or more time

  • ? Matches the item zero or one time

  • {n} matches a pattern n number of times

  • {n,} matches a pattern with n or more occurrence

  • {n,m} matches a pattern with at least n match and at most m match

That should get us up and running for the Regular Expression overview. We shall now examine Regex in Rust. ๐Ÿ˜„

'^(w|.|_|-)+@+[.]w{2,3}$' | programming-memes, programmer-memes, code-memes, computer-memes, program-memes, regex-memes, retweet-memes | ProgrammerHumor.io

Using Regex in Rust

Just before we plunge into it, it is noteworthy to mention flags, which are also called modifiers.

Flags are used to configure how characters are matched, they include

  • i Case insensitive flag - this is used to disregard case sensitivity. As it turns out, regex is case-sensitive by default

  • g Global match flag - this is used when we want the regex engine to retrieve all matches rather than the first one, which is the default

  • m Multi-line flag - as the name implies, it is used to perform Multi-line matching.

With that out of the way, let's get our hands dirty! ๐Ÿ’ฅ if you don't have the Rust toolchain installed on your device(laptop, desktop ...), lookup the installation guide at https://www.rust-lang.org/tools/install

The next steps assume a minimal experience with the Rust Programming language or other programming languages. The source code used in this example is available at https://github.com/opeolluwa/blog/regex-in-rust

The Regex Crate

Unlike JavaScript, Rust standard library does not include a Regex parser, instead, we'll be using a crate from the package registry https://crates.io ๐Ÿ“ฆ

At a glance, the regex crate (v1.60.0) has 2 structs; Regex and RegexBuilder, and a handful of functions; is_match, new , find ...

The Regex struct follows the default regex configuration, the prominent being, to return the first match, perform a case-sensitive match, etc. The RegexBuilder struct allows us to configure the regex parser to taste. Please see the detailed documentation

https://programmerhumor.io/wp-content/uploads/2023/07/programmerhumor-io-programming-memes-efcf2c763af5244-608x419.jpg

Examples

Regex can be intimidating to use, especially in a language like Rust which has a reputation for a very steep learning curve ๐Ÿช

Regardless, one way to work this out could be to work out the pattern first, then implement it in Rust.

We'll explore these examples using the combination of all we've discussed hitherto.

  1. Email validation

  2. URL extraction

  3. Date format validation (YYYY-MM-DD)

  4. Phone number validation (Nigeria format)

Email Validation

Here's a regex to validate emails. Let's break down the components!

^[a-zA-Z0-9._-]+@[a-zA-Z0-9]+\.\w{2,}$
  1. ^: The caret symbol anchors the pattern to the beginning of the string. It signifies that the pattern should match at the start of the input.

  2. [a-zA-Z0-9._-]+: This is a character class that matches one or more occurrences (+ quantifier) of characters that are either lowercase letters (a-z), uppercase letters (A-Z), digits (0-9), underscores (_), dots (.), or hyphens (-).

  3. @: This part matches the symbol "@" literally. It looks for the "@" symbol in the input.

  4. [a-zA-Z0-9]+: This is another character class that matches one or more occurrences of characters that are either lowercase letters (a-z), uppercase letters (A-Z), or digits (0-9). It represents the local part of the email address before the "@" symbol.

  5. \.: This part matches the period (dot) character literally. It looks for a period in the input.

  6. \w{2,}: This is a shorthand character class that matches word characters (letters, digits, or underscores). The {2,} quantifier specifies that it should match two or more occurrences of word characters. It represents the top-level domain part of the email address (e.g., .com, .org, .edu).

  7. $: The dollar sign anchors the pattern to the end of the string. It signifies that the pattern should match at the end of the input.

So, the overall regex ^[a-zA-Z0-9._-]+@[a-zA-Z0-9]+\.\w{2,}$ is designed to match email addresses that follow a specific format:

  • The local part can contain a combination of alphanumeric characters, underscores, dots, or hyphens.

  • It is followed by the "@" symbol.

  • The domain part can contain alphanumeric characters only.

  • It is followed by a period (dot) and then a top-level domain with at least two characters.

Examples of valid email addresses that match the pattern:

Please note that this regex is a simplified example and may not cover all possible variations of email addresses or internationalized domain names (IDNs). Depending on your specific use case, you might need to adjust or enhance the regex to handle other email formats or special cases.

Here's the implementation in Rust

use regex::Regex;

fn main() {
    let valid_email = "sampl44e-user@mailer.com";
    let invalid_email = "sample-user@mailer";
    let re = Regex::new(r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9]+\.\w{2,}$")
        .expect("error parsing the regex syntax");

    println!(
        " is the email: {} valid? {}",
        valid_email,
        re.is_match(valid_email)
    ); // is the email sample44e-user@mailer.com valid? true

    println!(
        "is the email {} valid {}",
        invalid_email,
        re.is_match(invalid_email)
    ); // is the email sample-user@mailer valid false
}

URL Validation

Here's for URL Validation, Let's break down the components of the regex pattern:

http(s)?:\/\/([\d\w]+)\.([\d\w]+)
  1. http: This part matches the character "http" literally. It looks for the "http" sequence in the input.

  2. (s)?: The question mark ? is a quantifier that makes the preceding (s) group optional. The (s) is a capturing group that matches the character "s" literally. So, this part allows the pattern to match both "http" and "https".

  3. :\/\/: This part matches the characters "://" literally. It looks for the "://" sequence in the input.

  4. ([\d\w]+): This is a capturing group that matches one or more word characters or digits. \d is a shorthand character class for digits (0-9), and \w is a shorthand character class for word characters (letters, digits, or underscores). The + quantifier means it matches one or more occurrences of the preceding group.

  5. \.: This part matches the period (dot) character literally. It looks for a period in the input.

  6. ([\d\w]+): Similar to the previous explanation, this is another capturing group that matches one or more word characters or digits.

So, the overall regex http(s)?:\/\/([\d\w]+)\.([\d\w]+) is designed to match URLs that start with either "http://" or "https://", followed by a domain name (a combination of word characters and digits), and then a top-level domain (another combination of word characters and digits).

Examples of valid URLs that match the pattern:

Please note that this regex is a basic example and may not cover all possible URL variations. Depending on your specific use case, you might need to adjust or enhance the regex to handle other URL formats or special cases.

Here's the implementation in Rust

use regex::Regex;
fn main() {
    let valid_url = "https://docs.regex.rust";
    let invalid_url = "http:://.rg.o";

    let re = Regex::new(r"http(s)?:\/\/([\d\w]+)\.([\d\w]+)").expect("error parsing regex");

    println!("is the url {} valid? {}", valid_url, re.is_match(valid_url));
    // is the url https://docs.regex.rust valid? true

    println!(
        "is the url {} valid? {}",
        invalid_url,
        re.is_match(invalid_url)
    );
    //is the url http:://.rg.o valid? false
}

Date Format Validation

This use case assumes the date is required in the format YYYY-MM-DD Let's break down the components of the regex pattern

^\d{4}-\d{2}-\d{2}$
  1. ^: The caret symbol anchors the pattern to the beginning of the string. It signifies that the pattern should match at the start of the input.

  2. \d: This is a character class that matches any digit (0-9). It is equivalent to [0-9].

  3. {4}: This is a quantifier that follows \d. It specifies that the preceding \d (which matches a single digit) should occur exactly 4 times. In other words, it requires four consecutive digits to match.

  4. -: This is a literal hyphen. It matches the hyphen character exactly as it appears.

  5. \d{2}: Similarly to the previous explanation, this matches exactly two consecutive digits.

  6. -: Another literal hyphen.

  7. \d{2}: Again, matches exactly two consecutive digits.

  8. $: The dollar sign anchors the pattern to the end of the string. It signifies that the pattern should match at the end of the input.

Putting it all together, the regex ^\d{4}-\d{2}-\d{2}$ matches a specific date format:

  • It should start with four digits (year).

  • Followed by a hyphen.

  • Then two digits (month).

  • Followed by another hyphen.

  • Finally, two digits (day).

Examples of valid dates that match the pattern:

  • 2023-07-15

  • 1999-12-31

Keep in mind that this regex enforces a specific date format (YYYY-MM-DD). If you need to validate dates in a different format or allow variations in separators (e.g., slashes instead of hyphens), you would need to adjust the regex pattern accordingly.

use regex::Regex;

fn main() {
    let valid_date = "1999-25-05";
    let invalid_date = "05-25-1999";

    let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").expect("error parsing regex");
    println!(
        "is the date \"{}\" format correct? {}",
        valid_date,
        re.is_match(valid_date)
    ); // is the date "1999-25-05" format correct? true

    println!(
        "is the date \"{}\" format correct? {}",
        invalid_date,
        re.is_match(invalid_date)
    ); // is the date is the date "05-25-1999" format correct? false
}

Nigeria Phone number validation

Let's break down the components of the regex pattern, which may be used to validate phone numbers in Nigeria.

^\+234\d{10}$
  1. ^: The caret symbol anchors the pattern to the beginning of the string. It signifies that the pattern should match at the start of the input.

  2. \+: This is an escape sequence for the plus symbol +. It matches the literal plus symbol in the input. In regex, the plus symbol has a special meaning, so when we want to match the actual plus symbol itself, we need to escape it with a backslash.

  3. 234: This part of the pattern matches the literal character "234" in the input. It's a fixed part of the pattern that should be present for a match.

  4. \d: This is a character class that matches any digit (0-9). It is equivalent to [0-9].

  5. {10}: This is a quantifier that follows \d. It specifies that the preceding \d (which matches a single digit) should occur exactly 10 times. In other words, it requires ten consecutive digits to match.

  6. $: The dollar sign anchors the pattern to the end of the string. It signifies that the pattern should match at the end of the input.

Putting it all together, the regex ^\+234\d{10}$ matches a specific format for a phone number:

  • It must start with the country code +234 (which is the country code for Nigeria).

  • It must be followed by exactly 10 digits, representing the local phone number.

Example of a valid phone number: +2348051234567

Keep in mind that this regex is tailored specifically for Nigerian phone numbers starting with the country code +234. If you want to validate other phone number formats, you would need to adjust the regex pattern accordingly.

use regex::Regex;

fn main() {
    let valid_phone = "+2340122863541";
    let invalid_phone = "+245090907468449";
    let re = Regex::new(r"^\+234\d{10}$").expect("error parsing regex");

    println!(
        "is the phone number \"{}\" valid {}",
        valid_phone,
        re.is_match(valid_phone)
    ); // is the phone number "+2340122863541" valid true

    println!(
        "is the phone number \"{}\" valid {}",
        invalid_phone,
        re.is_match(invalid_phone)
    ); //is the phone number "+245090907468449" valid false
}

I use my favorite one as a password | password-memes, regex-memes | ProgrammerHumor.io

Conclusion

TL;DR

A big shout out to you if you've come this far to get a good grasp of Regex in Rust.

As it turns out, Regex in Rust isn't hard, it's just details and intricacies built in bits and bytes over time till it becomes the bundle of cryptic arrangements we see. The source code used in this post is available at https://github.com/opeolluwa/blog/regex-in-rust๐Ÿ˜„

Till the next one, ๐Ÿ‘‹

ย