Using Regular Expressions in Big Data

Using Regular Expressions in Big Data

A regular expression, AKA regex, is a powerful yet really confusing tool. Although regular expressions are the technology behind text replacement and natural language processing, they are hard to read and even harder to write. Running regular expressions on Big Data is even more difficult - it takes a while before you get any results and find out whether the regular expression was correct or not. This post will try to help and clear up what regular expressions are and how to use them when processing Big Data.

Regular Expressions 101

Regular expressions are a sequence of characters that match patterns in strings. For example, to find all numbers in a block of text, use the regular expression \d+

  • \d is a placeholder for digits, could also be written as [0-9]

  • The plus character means we’re looking for one or more consecutive occurrences of it

Running the above regex on the lyrics of “In The Year 2525” would return: 2525, 3535, 4545, 5555, 6565, 7510, 8510, 9595.

More regular expression operators:

  • . -  match any character except newline

  • * - find 0 or more consecutive occurrences

  • ^ - match the start of a line

  • $ - match the end of a line

  • [qwerty] - match any of the characters in the square brackets

  • [^asdf] - match any characters except the ones in the square brackets

  • \w - match a word character, could also be written as [A-Za-z0-9_]

  • \b - non word character - all characters not matched by \w

  • \s - match a whitespace character

Example regular expressions:

  • [0-9a-f]* matches hex such as ff02d4ee

  • [a-z0-9_-] matches usernames such as lady-gaga_2014

  • (19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]) matches dates in YYYY-MM-DD format e.g. 2012-01-21, 1980/03/23, 1948 04-24


Testing Regular Expressions

Before you jump the elephant and use regular expressions in Big Data, do some testing. It’s really easy to get regex syntax wrong or formulate an incorrect expression. So use tools such as regexpal, take a small chunk of data, write the desired regex, and see if it works as expected.


Regex with Big Data

In Xplenty, regular expressions can be used either as part of a filter or select component. They are executed via the function REGEX_EXTRACT(string_expression, regExp, index). This function returns matches as a string and null if there is no match. The function receives the following parameters:

  • string_expression - any string expression - the field name which should be used as input for the regular expression, a literal value, or function call.

  • regExp - the regular expression

    • Should be surrounded by single quotes; single quotes within the regex should be escaped (e.g. \’)

    • To return a group from within the matching pattern, surround the relevant part by parentheses

    • All backslashes should be escaped - to match a sequence of digits enter ‘(\d+)’, use double escaping to match a backslash character ‘\\’

  • index - which match should be returned - for instance, 3 returns the third match of the regular expression, while 0 returns the entire match rather than only the requested groups

Example use cases:

  • REGEX_EXTRACT('213.131.343.135:5020', '(.*)\\:(.*)', 1) returns '213.131.343.135'

  • REGEX_EXTRACT('213.131.343.135:5020', '(.*)\\:(.*)', 2) returns ‘5020’

  • REGEX_EXTRACT('/user/superman/cape', '/user/(.*)/', 1) returns ‘superman’


Xplenty Regex Tutorial

  1. Open the relevant package or create a new one

  2. Add or open a filter component:

    • Enter the relevant field or function on the left
    • Choose ‘text matches’ in the operator drop-down in the center
    • Enter the pattern on the right - if you are looking for a pattern that could be found anywhere in the string, surround the pattern with .* from both sides
      regex-filter-component.png
    • Click okay
  3. Add or open a select component:

    • In the relevant field textbox, hold down the shift key and double click the mouse to open the expression editor
      select-component.PNG
    • Enter REGEX_EXTRACT (hit CTRL + spacebar for autocomplete) with the relevant parameters
      select-field-editor.PNG
    • Click done. If there are any parsing errors, please go over the syntax and make sure it’s correct. Note that the regular expression’s syntax is not checked at this point, only the function syntax.
      select-field-editor-error.PNG
    • Click okay.
  4. Verify the package by clicking the checkmark button on the top right of the package editor. If there are any errors, re-open the relevant component and re-check the syntax.

    validate-tool.PNG

If the regex is malformed, you may receive a Java IOException in RegexExtract when running the job on a cluster:

Caused by: java.io.IOException: RegexExtract : Mal-Formed Regular expression : userId=([^&*)
...

If this happens, re-check that the regular expression syntax is valid. Test it on the side with a small chunk of data as mentioned above.

Summary

Even though regular expressions are a bit complicated, they are one of the most useful tools for text searches. After overcoming the hurdle of learning how to use them, regular expressions could be used to match strings in Big Data with Xplenty, strings such as phone numbers, emails, URLs, and anything else that your Big Data heart desires.


Integrate Your Data Today!

Try Xplenty free for 7 days. No credit card required.