RegEx in Ruby: Scratching the surface

Jeremy Armah
5 min readNov 10, 2020

--

Image Source

Have you ever needed to find or replace characters in a string? Let’s say that we receive a huge block of string data. It has many data points in it but is still considered one piece of data because of its nature as a string. What can we clean up this data so we can use it the way we want to? Regular expression, or RegEx, is a tool that allows us to search for pattens in strings. We can find these patterns or replace them. Regex is not a Ruby exclusive tool,

Regex main uses are for data validation, searching, mass file renaming, and finding records within databases. Regex is a very powerful tool and because of this it can make our coding lives much easier but it can also have the opposite effect. Let’s look at how we can use this tool so we do not harm our data.

Regex Basics

Regular expressions have a very specific syntax and it can look pretty complicated to beginners. Let’s try to ease our way into this topic so we may not get overwhelmed.

In Ruby, we create a regular expression by placing the pattern we want to search for in between slashes like: (/pattern/). The main two Ruby methods for regex are match and scan. What is the difference between these two methods?

  • Scan returns an array of all the items that match the /pattern/
  • Match returns the first item in the string that matches as a MatchData object. If no match is found the method returns nil
Photo by Mick Haupt on Unsplash

Patterns

Literal Characters

The simplest form of regex is to match a single letter or word.
For example:

string = “Hello world, it’s me!”
string.match(/world/) -> #<MatchData “world”>

We can also scan to return an array of matches found in a string.

string = “Hello world, it’s me!”
string.scan(/world/) -> [“world”]

This is not too helpful though, we could just use the .include? method to see if the character is within the string.

Let’s explore the use of classes, ranges, and more to find more complicated matches in strings.

Character Classes

A character class allows us to find a set of allowed characters. If we want to find vowels in a string we can set a character class like [aeiou]

string = “Hello world, it’s me!”
string.scan(/[aeiou]/) -> [“e”, “o”, “o”, “i”, “e”]

This is returning each vowel it finds in the string.

Ranges

Ranges allow us to match multiple characters within a range. The most common ranges are [0–9] and [a-z]. Instead of typing out these ranges in brackets, we can use shorthand versions.
- \w is a range of all characters. [0–9],[a-z],[A-Z].
- \d is [0–9] range

string = “Hello world, it’s me!”
string.scan(/\w/) -> [“H”, “e”, “l”, “l”, “o”, “w”, “o”, “r”, “l”, “d”, “i”, “t”, “s”, “m”, “e”]

This is looking at the string and returning every valid match. Again, scan puts out an array.

Quantifiers

Quantifiers allow us to increase the range on a match or search. A very common quantifier is +.
+ just returns the rest of the characters attached to a pattern match.

string = “Hello world, it’s me!”
string.scan(/\w+/) -> [“Hello”, “world”, “it”, “s”, “me”]

So now instead of returning every valid match separated it returns every valid pattern match up until the pattern is broken.

Anchors

Anchors let us look at specific positions before, after, or in between characters. A common anchor \b looks for word a boundary (beginning or end).

string = “This string is awesome.”
string.scan(/\b\w/) -> [“T”, “s”, “i”, “a”]

Let’s combine what we know about quantifiers with this search.

string = “This string is awesome.”
string.scan(/\b\w+/) -> [“This”, “string”, “is”, “awesome”]

Combining even more, we return words that only start with vowels.

string = “This string is awesome.”
string.scan(/\b[aeiou]\w+/) -> [“is”, “awesome”]

Another Example

Let’s look at some more examples of how we can use regex in a more real world example.

We have a string that contains multiple emails. We need to be able to separate each unique email from one another. The email string looks like this:

emails = “jeremy342@test.com, chris563@test.com greg234@mail.com, test123@giraffe.com”

We can’t use a traditional split method here because we need to be able to separate each email by spaces and by commas at the same time. Regex is perfect for this.

If we try to split normally like:

emails.split(“, ”) we would get
[“jeremy342@test.com”, “chris563@test.com greg234@mail.com”, “test123@giraffe.com”]

That’s no good, the Chris and Greg emails are still together. If we separated by space then the commas would not be separated.

Using regex within our split expression we can separate by both spaces and commas.

emails.split(/\s|, /) would give us
[“jeremy342@test.com”, “chris563@test.com”, “greg234@mail.com”, “test123@giraffe.com”]

Perfect! Let’s understand what is happening here.

  • \s will look for matches where whitespace exist, like a space. This handles one split condition
  • | or pipe means “either or”. It allows us to add another criteria for the split
  • , is input a literal character. Anywhere there is a comma will return a match

Because we are running this through a split method, wherever there is a match the string will be split.

This is another simple example of how regex can be used to make data easier to work with.

Below I have listed out some of the most common regex patterns used. I hope this small post could help shed some light on a tool that seems very complicated at an initial glance.

Here is a basic list of patterns that we can use:

. — Any Character Except New Line
\d — Digit (0-9)
\D - Not a Digit (0-9)
\w - Word Character (a-z, A-Z, 0-9, _)
\W - Not a Word Character
\s - Whitespace (space, tab, newline)
\S - Not Whitespace (space, tab, newline)

\b - Word Boundary
\B - Not a Word Boundary
^ - Beginning of a String
$ - End of a String
[] - Matches Characters in brackets
[^ ] - Matches Characters NOT in brackets
| - Either Or
( ) - Group

  • — 0 or More
    + — 1 or More
    ? — 0 or One
    {3} — Exact Number
    {3,4} — Range of Numbers (Minimum, Maximum)

--

--