Regex Functions


A regex is a string of text that allows you to create patterns that help match, locate, and manage text. Sheetkraft contains various useful regex functions that can be helpful for matching, replacing, searching strings. We use various dialects to create patterns for any operation (Search, Match, Replace, Split) for regex:

. matches any single character, not just an actual dot character. Dot is thus a special character in a pattern.

. matches a literal dot. The backslash is an escape character. It escapes the meaning that the following character would have if the backslash were not present. Since dot matches any character, escaping this meaning makes it match the literal dot only.

\ matches a single literal backslash. The first backslash escapes the meaning of the second backslash so that the second backslash is no longer an escape character.

abc matches the literal text abc. All letters and digits match themselves. They do not have a special meaning.

\s matches a single space character. Other than the normal space (character code 32 there are various other characters like tab, non-breaking space etc that are considered spaces. \s matches any such space character.

\S matches any single character that is not a space. In general, if a backslash followed by a lower-case letter matches something, a backslash followed by an upper-case letter matches anything other than that something.

\d matches any single digit.

\w matches any single word character. A word character matches a letter or digit or underscore.

\d\d matches two consecutive digits.

\d\d.\d\d matches two consecutive digits immediately followed by a literal dot immediately followed by two consecutive digits.

[a-z] matches any single character in the range lower case a to lower case z

[a-z0-9] matches any single character in the range a to z or 0 to 9.

[0-9_] matches any single character in the range 0 to 9 or an underscore.

[-+] matches any single character that is a literal minus or a literal plus. Minus does not have the same meaning as in the examples above because there is no character between the opening square bracket and itself.

[0-9] matches any single character that is a 0 or a literal minus or a 9. Minus does not form a range because it is escaped.

[[]] matches any single character that is an opening square bracket or a closing square bracket. Note that the second opening square bracket does not need to be escaped but the first closing square bracket needs to be escaped. This same pattern could also be written as [[]] where the second opening square bracket is escaped unnecessarily. This works because there is no meaning defined for an escaped opening square bracket in this context, so the escape character is effectively ignored.

\d* matches zero or more consecutive digits

\d+ matches one or more consecutive digits

[a-z]+ matches one or more consecutive characters each of which is in the range a to z.

\d{1,3} matches at least 1 and at most 3 consecutive digits

\d{3,} matches 3 or more consecutive digits

v\d{3,} matches a literal v followed by 3 or more consecutive digits

\d{5} matches exactly 5 consecutive digits

? is a shortcut for {0,1}, just as * is a shortcut for {0,} and + is a shortcut for {1,}

abc|def matches either abc or def. | has very low precedence. So, this is not ab followed by either of c or d followed by ef. It is either of abc or def.

We can restrict where the match occurs by using anchors ^ and $

^abc matches any text starting with abc. It does not match xabc

abc$ matches any text ending with abc. It does not match abcy

^a.c$ matches any text starting with a followed by any single character and ending with c. It matches abc but does not match xabc or abcy

^^abc is equivalent to ^abc. To get a literal ^, it will need to be escaped as in ^^abc which matches ^abc.

Let us consider a simple case:

(Note: This is a simple application of regex functions. This can be implemented using various other search patterns. During Implementation you can come across various problem statements, for example: file path parsing, date extraction, Account number extraction from a bank statement, which can be achieved effectively using regex functions.)

Question: You need to extract the buyer, seller, number of units and date from the given data. Then from the original data you need to remove the Date to form a new data. Check whether units were not lost from the new data. After that finding the day, month, year from date. All the above tasks should be done using regex functions.

img1


  1. Regex Search: This function produces a first matched substring, given the regex search pattern, pattern formats and original string as inputs. Use search pattern with RegexSearch to extract buyer, Seller, No of units, Date.

img2


img3


img4


img5


Fill down the formulae using filldown.sk function.

img6


  1. Regex Replace: This function finds the sub-string matching the regex search pattern and replaces it with the expression based on the replace pattern. Use RegexReplace to remove the date from the original text.

img7


Fill down the above formula for all the entries.

img8


  1. Regex Match: This function finds the substring matching the regex search pattern from the main string. The output of this function is either TRUE or FALSE. Use RegexMatch to check whether there are numbers in the new sentence. Filldown for all the entries.

img9


img10


  1. Regex Split: This function splits the main string into substrings based on the delimiter and width as inputs. Use RegexSplit to extract the day, month, year from the Date. Filldown for all the entries.

img11


img12


  1. Regex Search All: A very similar function to Regexsearch.sk, the only difference is that regexsearch.sk produces the first matched substring, whereas, regexsearchall.sk produces all the sub-strings, that satisfies the regex search pattern from the original string. Let us see the example in an HTML file (See Figure).

img13


We want to store all the details in the database. We will use lazy regex here instead of greedy. We will fetch table data using Regex 2a and Regex 2b between two strings (See Figure).

img14


Special Case: Regex Search with 3rd Argument

We can see the function arguments from UI of Microsoft Excel (See Figure)

img15


If the pattern has multiple capture groups, then the 3rd argument is used (if required).

Capture Groups: In order to extract parts of a match, the part has to be put in parenthesis, which is called as capturing group.

For Example:

(\d{3}).(\d{4}) matches a sequence of 3 digits followed by a literal dot followed by a sequence of 4 digits.

(?:abc){2}(\d{3}) matches two consecutive occurrences of abc followed by three digits. , for example abcabc121. Parenthesis create a capturing group, but sometimes we want to group without capturing the grouped part. Here abc is grouped but not captured. Grouping it allows specification of the quantifier {2} that applies to the whole group. \d{3} is grouped and captured. Since there is no capturing group before it, it is numbered 1. Non-capturing groups are not numbered as there would be no point to numbering them.

((?:abc){2}|(?:def){2})xyz matches either abcabcxyz or defdefxyz. This contains a single capturing group.

((abc){2}|(def){2})xyz also matches either abcabcxyz or defdefxyz. But this contains three capturing groups. The number of the capturing group is based on the order in which the opening parenthesis of the capturing group occurs in the pattern. The first capturing group is (abc){2}|(def){2}, the second group is abc and the third is def.

\\/+ matches a sequence of at least one character that is not a forward or backward slash. ^ negates the character class. This can be used to match a part of a file system path. For example \\/+$ can be used to extract the file name from a path.

In regex functions, “1$”, “2$” is used, which depicts the 1st capturing group and 2nd capturing group, respectively. This can be used in Regexsearch, SearchAll, Replace functions.

For example, we want to maintain a database of employees’ vehicles in some database (assuming 1 vehicle per head) in multiple capture groups such as State/Union Territory, RTO.

img16


Regex inside other functions/ (Indirect use of Regex through Sheetkraft UI)

  • In Filter: For Example, we want to filter employees based on their pan cards if entered in a correct manner. For that we can filter the pan using regex (See figures)

img17


Final Output will look like this:

img18


Regex Replace in Clean Text:

The same example in Regex Replace can be seen here. Regex replace can be used using sheetkraft UI in “More” options. (See Figure)

img19