What Is Regex? A Primer on Writing & Using Regular Expressions

Regular expressions can be used for a variety of string processing tasks, from validating data to searching large amounts of text quickly. By understanding regular expressions, you can create powerful tools to find exactly what you’re looking for.

Regular expressions are powerful tools that can be used to search, manipulate, and validate strings.

While they may seem complex at first, understanding the concepts and components of regexes is essential for any developer.

What is a regular expression?

A regular expression, commonly referred to as regex or regexp, is a sequence of characters that define a search pattern. It is mainly used for text-based searching and string manipulation.

Regular expressions are often used in web development to validate user input or find specific strings of characters within larger blocks of text. They are also widely used in data science, natural language processing, and text analytics.

Regular expressions are written using a combination of special characters and literal characters. These special characters allow regex to be used to find patterns in strings, while the literal characters are used as exact matches for specific words or phrases.

For example, if you wanted to match any string that starts with “ABC” and ends in “xyz”, you could use the following regex:

/^ABC.*xyz$/

This regex would match any string that begins with “ABC” and contains any number of characters before ending in “xyz”.

Regular expressions can be used for a variety of tasks, from validating data to searching large amounts of text quickly. They are an essential tool for any programmer or data scientist working with strings and text-based data.

Components of Regexes

Regexes are composed of several different components that work together to define a search pattern. The most common components include anchors, character classes, quantifiers, and alternation.

Anchors

Anchors match the starting and ending points of a string or line. For example, you can use ^ to match the beginning of a string, and $ to match the end of a string.

Character Classes

Character classes allow you to define specific sets of characters that can be used in a search pattern. For example, you can use \d to match any digit (0-9), or \w to match any alphanumeric character (a-zA-Z0-9).

Quantifiers

Quantifiers allow you to specify how many of a character or character class should be matched. For example, you can use * to match zero or more occurrences of the preceding character or character class, or + to match one or more occurrences of the preceding character or character class.

Alternation

Alternation allows you to specify multiple possible search patterns that can be matched. For example, you can use | to match either of two characters or character classes.

These components are just a few of the many that make up regular expressions. Knowing how to use these components correctly is essential for creating powerful regexes and finding specific strings of text.

Basic Concepts

It’s important to understand the basic concepts of regex. The most important concepts include:

  • Notation: A shorthand used to refer to common patterns.
  • Syntax: The set of rules that govern how regexes are written.
  • Patterns: A sequence of characters that define a search pattern.
  • Matching: The process of finding a pattern in a string.
  • Capturing: The process of capturing part or all of a match into a group for later use.
  • Substitution: The process of replacing a matched pattern with another string.
  • Flags: A set of instructions that modify the behavior of a regex.

Syntax and Notation

Before you can begin writing regexes, it’s important to familiarize yourself with the syntax and notation used. The most common syntax elements include:

  • Backslash ( \ ): Escapes special characters and allows them to be used as literal characters.
  • Brackets ( [] ): Defines a set of characters to be matched.
  • Parentheses ( () ): Captures part of a match into a group.
  • Asterisk ( * ): Matches zero or more occurrences of the preceding character or character class.
  • Plus sign ( + ): Matches one or more occurrences of the preceding character or character class.
  • Question mark ( ? ): Matches zero or one occurrences of the preceding character or character class.
  • Pipe ( | ): Used to match either of two characters or character classes.

The most common notation elements include:

  • ^ : Matches the beginning of a string or line.
  • $ : Matches the end of a string or line.
  • \d : Matches any digit (0-9).
  • \w : Matches any alphanumeric character (a-zA-Z0-9).

Learning these syntax and notation elements will help you create more powerful regexes.

Patterns

Regexes are composed of patterns that define a search pattern. The most basic regex is simply a sequence of characters. For example, the string `cat` will match any occurrence of the characters “c”, followed by “a”, followed by “t” in a string.

You can also use special characters and character classes to create more powerful regexes. For example, the string \d\w+ will match any sequence of one or more alphanumeric characters (a-zA-Z0-9).

You can also combine multiple patterns together to create complex regexes. For example, the string ^\d{2}-\w+ will match any sequence of two digits, followed by a dash, followed by one or more alphanumeric characters.

Matching

Once you have created a regex to define your search pattern, you can use it to match against strings of text. There are many different tools and programming languages that allow you to do this, including Python, JavaScript, PHP, and Perl.

Capturing

In addition to matching strings of text, regexes can also capture part or all of a match into a group.

This is useful if you want to store part of the matched text for later use.

For example, the regex \d{2}-\w+ can be used to capture the first two digits and the rest of the alphanumeric characters into separate groups.

Substitution

Regexes can also be used to substitute matched patterns with another string. This is useful if you want to replace certain parts of a string with different text.

For example, the regex (\d{2})-\w+ can be used to replace the first two digits and the rest of the alphanumeric characters with a new string.

Flags

Finally, regexes can also use flags to modify their behavior. These flags allow you to change how the regex behaves, such as making it case-insensitive or allowing it to match on multiple lines.

Building & using regular expressions

Regex is an incredibly powerful tool for manipulating strings of text. With a few basic patterns, syntax, and notation elements you can create powerful regexes that can be used to match, capture, and substitute portions of strings.

Your first regex: Finding matches in a string

Now that you understand the basic concepts and syntax of regex, it’s time to begin building your first regex.

  • Step 1: Determine the Pattern. The first step is to determine the pattern you want to match. Generally speaking, this will be a sequence of characters that describe the type of string you are searching for.
  • Step 2: Create Your Regex. Once you have determined your pattern, it’s time to create your regex. You can do this by using the syntax and notation elements described above to construct a search string.
  • Step 3: Test Your Regex. Once you have created your regex, it’s time to test it. This will allow you to see if your regex matches the strings you are searching for.
  • Step 4: Use Your Regex. Once you have tested and verified that your regex works, it’s time to use it. This can be done by using your regex in a tool or programming language that supports regexes.

Step 1: Determine the Pattern

When problem-solving, take time to think through the pattern and breaking it down into conditions. This will help you create a regex that is more specific and accurate, and save time on trial and error resulting from a lack of planning.

For example, say you want to match all strings that contain the word cat.

If you have any exposure to basic linguistics, it helps to think using those sorts of concepts.  In this case:

  • You’re looking for a specific sequence of three letters—not just any combination that includes those letters, or wherever any of those letters occur alone.
  • You want to find it whether it stands alone or constitutes a part of a longer word.
  • You know that cat is a common sequence of letters.
  • You know it can occur at the start, end, or in the middle of words.

This simplifies your capture requirements, but means you’ll need to consider some other parameters. For example:

  • Will you match strings with uppercase and lowercase letters?
  • Do punctuation or symbols affect the outcome?
  • What about strings that might contain multiple instances of the word cat (e.g. “Cats have cats”)?
  • Will you match strings that contain the word “cat” but have other words before or after it?

These parameters will help you determine your regex search pattern.

Breaking down the elements of your pattern requirements before writing code will help you write more effective regex faster.

Once your pattern is determined, you can move on to Step 2: Create Your Regex.

Step 2: Create Your Regex

Now that you know what type of string you’re looking for, it’s time to create the regex that will search for it.

Let's continue on our quest to match all strings that contain the word cat.

Your regex would be:

\w+cat\w+

This regex will match any sequence of characters that contains the word “cat”. On the other hand, this regex would return many false positives:

\bc?a?t\b

This regex would match strings that contain the letters “c”,“a”, and “t” in any order. It would also match strings without all three letters.

\bc?at\b

This regex is more specific and only matches the exact string “cat”.

It tells the program to search for “cat” at either the start of a string (indicated by the \b) or in the middle of a string, and to ignore any punctuation or other characters that might be present (indicated by the ?).

The \b indicates a word boundary and the question mark after the “c” makes it optional.  This means that this regex will match both “cat” and “Cat”, but not any other strings.

When creating your regex search string, you’ll want to consider parameters like case sensitivity and multiple matches.

Case sensitivity: If you want to make your search case-insensitive, you can add the i flag at the end of your regex.

/\w+cat\w+/i

The / marks indicate the start and end of the regex, while the i at the end is an optional flag that makes your search case-insensitive (so it will match both upper and lowercase letters).

Multiple instances: If you want to look for multiple instances of a pattern in a string, you can use the asterisk) as a wildcard character to match any number of characters.

For example, if you want to match strings that have multiple instances of “cat” in them, you can use the following regex:

\w+cat\w*cat\w+

This regex will match strings such as “Cats have cats”.

You can also use the plus sign (+) to indicate that one or more instances of a certain character must be present in order for it to be matched.

For example, this regex will match strings such as “catty”, but not “cat”:

\w+cat+y\w+

Exact pattern matching: If you want to look for an exact pattern, you can use the caret (^) and dollar sign ($) as anchors to mark the start and end of a string.

For example, the following regex will match an exact string of “cat”:

^cat$

Now that you have created your regex search string, it’s time to move on to Step 3: Test and Tweak Your Regex.

Step 3: Test and Tweak Your Regex

Once you’ve created your regex search string, the next step is to test it. This will allow you to see if your regex matches the strings you are searching for.

You can use tools like RegExr and Regex101 to test your regexes in real-time and make sure they match the right data.

Testing your regex will help you identify any errors or problems with your pattern before writing code, which can save you time and effort in the long run.

If you find that your regex is not working as expected, don’t be afraid to tweak it. Regular expressions can be written in a variety of ways and it’s often a matter of trial and error to find the one that works for your data.

Once you have tested your regex and tweaked it until it matches what you’re looking for, you can move on to Step 4: Use the Regex in Your Code.

Step 4: Use Your Regex in Your Code

Now that you have your regex written and tested, the next step is to use it in your code. Depending on the language you are using, there are different ways to implement a regex search.

For example, in JavaScript you would use the String.prototype.match() method:

let str = ‘cat’
let regexp = /\w+cat\w+/i;
console.log(str.match(regexp)); // Output: ["cat"]

In Python, you would use the re.search() function:

import re
string = ‘cat’
regexp = r’\w+cat\w+’
result = re.search(regexp, string)
print(result.group()) // Output: cat

This way, you can use your regex search in any part of your code that requires pattern matching. Most languages and tools support regexes, including JavaScript, Python, Java, and PHP.

Master JavaScript in 2023: 27 Books for Every Skill Level to Unlock the Power of the Web
Are you looking to master JavaScript in 2023? Our comprehensive list of 27 top books is sure to help you unlock the power of the web and become a more efficient programmer. From absolute beginners to advanced developers, each book on this list offers something unique - get started now and take your…

Using regex can be a powerful way to solve problems involving text parsing and search-and-replace operations.

Exercise 2: Matching email addresses with regex

Now that you’ve seen how to use regular expressions for string matching, let’s try using one to match email addresses.

The goal here is to write a regular expression that will match valid email addresses.

Say you want to capture all strings that include an email address. To write this in regex, it would look like this:

\[email protected]\w+\.\w{3}

The above regex will match strings that contain a username, followed by an “at” symbol (@), followed by a domain name and top-level domain (such as .com or .net).

Let's walk through the notation that achieves this result.

  1. The \w+ symbol matches any sequence of characters, including letters, numbers, and underscores. This allows us to capture the username portion of an email address.
  2. After that comes the “at” symbol (@), which is escaped with a backslash (\). This is because the “at” symbol has a special meaning in regex, and must be escaped to match the literal character.
  3. Following this is another \w+, which captures the domain name portion of an email address.
  4. Finally, the last sequence— \w{3}— specifies that we want only three characters at the end of the email address, which is what a top-level domain (TLD) looks like.

Let's consider how we could improve this pattern by considering edge cases and improving its ability to accurately select valid email addresses. Here are some examples of what we want our regex to match:

[email protected]
[email protected]
[email protected]

And here are some examples of strings that should not be matched:

notvalid.com
another [email protected]@domain.net
not valid again!%$#$

In the above examples, we can see that usernames and domains can include special characters such as hyphens (-), underscores (_), and plus signs (+). To account for these variations, we would need to add to our regex pattern.

To make our regex more accurate and capture the examples we want, let's add some additional characters to specify what should be matched.

This time, our regex pattern should look like this:

\w+([.-]?\w+)*@\w+([.-]?\w+)*(\.\w{2,3})+

The additional characters we added—([.-]?\w+)*—allow us to match certain special characters and hyphens in the username or domain name portion of an email address. We also added \w{2,3}, which specifies that the TLD should be two to three characters long.

Now, when you test this regex against the examples given above, it will match all of the valid email addresses and reject all of the invalid ones.

Let's break down the pattern:

  • \w+ matches a sequence of characters (including letters, numbers and underscores) that is at least one character long.
  • ([.-]?\w+)* allows us to match any special characters or hyphens in the username portion of an email address.
  • \w+ matches a sequence of characters that is at least one character long.
  • ([.-]?\w+)* allows us to match any special characters or hyphens in the domain name portion of an email address.
  • (\.\w{2,3})+ specifies that the top-level domain (TLD) should be two to three characters long.

With this pattern, you can accurately match valid email addresses and reject invalid ones. By modifying our regular expression to match a wider range of valid email addresses, we can ensure that it will be more accurate and reliable in parsing strings.

Using quantifiers in regex

One of the most useful aspects of regex is the ability to use quantifiers. Quantifiers indicate how many times a character, group, or pattern should be matched in a string.

For example, if you want to match any number from 0-99, you could use the following regex:

\d{1,2}

This will match any single- or double-digit number, which makes it much easier to find the matches you're looking for.

There are several types of quantifiers available in regex. These include:

Question mark (?) - 0 or 1 occurrence

\d?

Asterisk (*) - 0 or more occurrences

\d*

Plus sign (+) - 1 or more occurrences

\d+

Curly brackets ({m,n}) - m to n occurrences

\d{2,4}

Quantifiers make it much easier to match patterns in a string. They allow you to quickly specify how many times a character or group should be matched, making your regex more succinct and accurate.

Using lookaheads and lookbehinds

Lookaheads and lookbehinds are another powerful tool available in regex. They allow you to check for a certain pattern before or after the main expression without actually including it in the match.

For example, if you want to match all instances of the word “apple” that appear before the phrase “is tasty”, you could use the following regex:

(?<=apple)\sis\stasty

In this regex pattern, the lookbehind (?<=apple) checks for “apple” before “is tasty” and then matches it. This makes it much easier to match a pattern based on its context.

Lookaheads work similarly to lookbehinds, but they check for a certain pattern after a main expression instead of before it. For example, if you want to match all instances of the word “apple” that appear after the phrase “is tasty”, you could use the following regex:

apple\s(?=is\stasty)

In this regex pattern, the lookahead (?=is tasty) checks for “is tasty” after “apple” and then matches it. This is a powerful way to match patterns based on their context.

Quoting literally

Regex offers the ability to quote a string literally. Quoting a string means that you treat it as if it were an exact match, instead of using any special characters or regex syntax.

For example, if you wanted to match the phrase “apple is tasty” exactly, you could use the following regex:

\Qapple is tasty\E

The \Q and \E characters tell regex to quote the string “apple is tasty” literally, so that it doesn't interpret any special characters or regex syntax within it. This makes it much easier to match strings that contain special characters.

Using anchors

Anchors allow you to match a pattern at the beginning or end of a string.

For example, if you wanted to match all words that start with “a”, you could use the following regex:

^a\w+

In this regex pattern, the caret (^) is an anchor that matches the beginning of a string. Similarly, if you wanted to match all words that end with “s”, you could use the following regex:

\w+s$

In this regex pattern, the dollar sign ($) is an anchor that matches the end of a string.

Using groups and backreferences

Groups allow you to match a pattern multiple times within a string, and backreferences allow you to refer back to previously matched groups.

For example, if you wanted to match all words that start with “a” and end with “s”, you could use the following regex:

^(a\w+s)$

In this regex pattern, the parentheses are used to create a group that matches “a” followed by one or more word characters and then “s”. This makes it much easier to match multiple instances of the same pattern in a string.

Backreferences allow you to refer back to previously matched groups. For example, if you wanted to match all words that start with “a” and end with “s”, but also have the same middle characters, you could use the following regex:

^(a\w+s)\1$

In this regex pattern, the backreference (\1) refers back to the first group and matches “a” followed by one or more word characters and then “s” again. This is a powerful way to match patterns that have multiple parts and need to be the same.

Matching digits with a character class

Character classes allow you to match any character from a predefined set of characters.

For example, if you wanted to match any digit (i.e., 0 - 9), you could use the following regex:

\d

In this regex pattern, the \d character class matches any digit from 0 - 9. This makes it much easier to match digits in a string.

Other character classes

There are many other character classes available for use in regex. For example, if you wanted to match any lowercase letter (a - z), you could use the following regex:

\l

In this regex pattern, the \l character class matches any lowercase letter from a - z. Similarly, if you wanted to match any uppercase letter (A - Z), you could use the following regex:

\u

In this regex pattern, the \u character class matches any uppercase letter from A - Z. Character classes are a powerful and versatile way to match patterns in strings.

Using modifiers

Modifiers are another powerful tool available in regex. Modifiers allow you to change the behavior of a regular expression.

For example, if you wanted to match all words that start with “a” and end with “s”, regardless of case, you could use the following regex:

(?i)^(a\w+s)$

In this regex pattern, the (?i) modifier ignores the case of the characters (i.e., it will match “A” as well as “a”). This makes it much easier to match patterns in strings that may have different cases.

Other modifiers are available for use in regex, such as the (?s) modifier which allows you to match across multiple lines.

Using alternation to create complex patterns

Alternation allows you to match one of several patterns. It is denoted by the use of the pipe (|) character.

For example, if you wanted to match either “cat” or “dog”, you could use the following regex:

cat|dog

In this regex pattern, the pipe character is used to match either “cat” or “dog”. This makes it much easier to match complex patterns in strings.

Alternation can also be used with groups and character classes. For example, if you wanted to match either “cat” or any word that starts with “d” and ends with “g”, you could use the following regex:

cat|^(d\w+g)$

In this regex pattern, the pipe character is used to match either “cat” or a word that starts with “d” and ends with “g”. This makes it much easier to match complex patterns in strings.

Negative and positive matching

Negative matching allows you to match characters that are not present in a string, while positive matching allows you to match characters that are present in a string.

For example, if you wanted to match any word that does not start with “a”, you could use the following regex:

^(?!a)\w+$

In this regex pattern, the negative lookahead (?!a) is used to match any word that does not start with “a”. This makes it much easier to match patterns in strings that may have certain characters.

Similarly, if you wanted to match any word that does start with “a”, you could use the following regex:

^(?=a)\w+$

In this regex pattern, the positive lookahead (?=a) is used to match any word that starts with “a”. This makes it much easier to match patterns in strings that may have certain characters.

Matching across multiple lines

When using regex, it is sometimes necessary to match patterns across multiple lines. The dot (.) character can be used to match any character, including line breaks.

For example, if you wanted to match any sentence that starts with “The” and ends with “end”, regardless of how many lines it spans, you could use the following regex:

^The.*end$

In this regex pattern, the dot (.) character is used to match any character, including line breaks. This makes it much easier to match patterns in strings that span multiple lines.

Using recursive patterns

Recursive patterns allow you to match repeating patterns in strings. This is useful for matching HTML tags, for example.

For example, if you wanted to match any HTML tag with the attribute “class”, regardless of its contents, you could use the following regex:

<.*?class=".*?".*?>

In this regex pattern, the parentheses are used to create a recursive pattern. This makes it much easier to match patterns in strings that may have repeating patterns.

Basic regex examples

Now that you have a basic understanding of the tools available in regex, let’s look at some basic examples.

Matching a North American phone number

If you wanted to match a North American Phone number, you could use the following regex:

^\d{3}-\d{3}-\d{4}$
Matching a North American phone number

In this regex pattern, the \d character class matches any digit from 0 - 9. The {3} and {4} modifiers specify that the pattern should match 3 or 4 digits, respectively. This makes it easy to match North American phone numbers in a string.

Matching a date in YYYY-MM-DD format

If you wanted to match a date in YYYY-MM-DD format, you could use the following regex:

^\d{4}-\d{2}-\d{2}$

In this regex pattern, the \d character class matches any digit from 0 - 9. The {4}, {2}, and {2} modifiers specify that the pattern should match 4, 2, and 2 digits respectively.

Matching an IP address

If you wanted to match an IP address, you could use the following regex:

^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$

In this regex pattern, the \d character class matches any digit from 0 - 9. The {1,3} modifiers specify that the pattern should match 1 to 3 digits. This allows it to match IP addresses in a string.

Finding valid credit card numbers

If you wanted to find valid credit card numbers, you could use the following regex:

^\d{16}$|^\d{4}\s\d{4}\s\d{4}\s\d{4}$
Finding valid credit card numbers

In this regex pattern, the \d character class matches any digit from 0 - 9. The {16} and {4} modifiers specify that the pattern should match 16 or 4 digits respectively. This allows it to match valid credit card numbers in a string.

Match all HTML tags

If you wanted to match all HTML tags, you could use the following regex:

<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|["']?(.*?)["']?))?)+\s*|\/?)>
Match all HTML tags

In this regex pattern, the \w character class matches any letter or digit character. The +, *, and ? modifiers specify that the pattern should match one or more of the preceding character, zero or more of the preceding character, and zero or one of the preceding character respectively. This allows it to match all HTML tags in a string.

10 Complex Regular Expression Examples to Master Regex
Regex opens up a world of possibilities for manipulating and extracting data. We’ll walk through 10 complex regex examples to help you become a regex master.

Regex Resources

Once you’re familiar with the basics of regexes, there are countless resources online that can help you refine your skills. Here are a few to get you started:

Variables is reader-supported. When you buy through our links, we may earn a commission.

Regular Expressions Cookbook: Detailed Solutions in Eight Programming Languages

Take the guesswork out of using regular expressions. With more than 140 practical recipes, this cookbook provides everything you need to solve a wide range of real-world problems. Novices will learn basic skills and tools, and programmers and experienced users will find a wealth of detail. Each recipe provides samples you can use right away.

Buy on Kindle or paperback

Now that you’ve got the basics of regex down, there are several tools available to help you work with them. Here are a few of the most popular:

  • RegexBuddy – An interactive tool for designing and testing regular expressions.
  • RegexPal – An online tool for testing and debugging regexes.
  • RegexMagic  – An automated regex builder for creating complex patterns.
  • Sublime Text – A text editor that includes support for working with regular expressions.
  • Visual Studio Code – An advanced code editor from Microsoft that includes built-in regex support.

Conclusion

Regular expressions are a powerful tool for manipulating text and can be used for a variety of tasks, from validating data to searching large amounts of text quickly.

By understanding the components and concepts that make up regular expressions, you can create powerful regexes to find exactly what you’re looking for. There are also many tools available to help you write, test, and debug regexes.

With the right tools and knowledge, you can easily master the art of using regular expressions.