How to use regular expressions (regex)?

Tutorial

How and why do we use regex? Mastering regex from basic to more advanced concepts.

Informations
  • axel.thevenot@edu.devinci.fr
  • 06 24 98 20 33
Dates
  • Creation: 04/01/2020
  • Update: 05/02/2020
axel_thevenotinnovationrule-based-chatbot-from-scratch

Regular expressions


Regular expressions (or regex) are tools that are used to represent model of strings. They are used to detect, examine, modify, manipulate strings. Simply speaking, if you want to find all the names in a text, you can use a regex. We know that a name contains only letters and begins with an uppercase letter. Regex allow us to transcribe the previous sentence into a computational representation. I put in attached file a cheatsheet on regex that I invite you to download and to look at in side of the tutorial because I'm not going to treat everything here to avoid repetitions.


For those who already know some regex notions, it may be one of your beasts of burden. I suggest you to forget your a priori and start again on good bases with our friend. And for those who don't know what is a regex, let's remember to not judge a book by its cover.


Text to copy paste in a file.txt

REGEX :

on\s(?P<Content>.+)\son\s(?P<Support>.+)


TESTS :

Hello world !

Welcome to this tutorial on regular expressions on python

I have a meeting with Michael Anderson at 7pm



In this tutorial we will review the basics of regular expressions using python for dynamism. To make it comfortable to follow this tutorial, I suggest you to copy and paste the little code below.

WARNING : you must not modify the structure of the text file when you will test your regex. Also, if you are on Windows, you will have to change the last line of this code by referring to the comment.


This code allows you to test a regex live with as many test sentences as you want. The text to copy and paste above shows how our text file should look like. Below the line "REGEX : " we will write the regex to be tested. Below the line "TESTS : " we can write a list of strings to be tested with our regex.


Python script to copy paste in a main.py

import re

import os

import time


class SGR:

   """

   Select Graphic Rendition to display styles on the console


   If you want to know more about or change colors,

   you can find any information you want with the url below

   https://en.wikipedia.org/wiki/ANSI_escape_code#SGR_parameters"""

   reset = '\033[0m'

   bold = '\033[1m'

   class fg:

     black = '\033[30m'

     red = '\033[31m'

     green = '\033[32m'

   class bg:

     green = '\033[42m'


   @staticmethod

   def render_element(name, element, start='', sep=' : ', end='\n'):

     return f'{start}{SGR.bold}{name}{sep}{SGR.reset}{element}{end}'


def get_data(filename):

   # read the file

   file = open(filename, 'r')

   content = [x.strip() for x in file.readlines()]

   file.close()

   # get the regex on the second line

   regex = content[1]

   # get the regex on the 5th line and more

   tests = content[4:]

   return regex, tests


def test_regex(regex, test, main_color, match_color):

   r = re.compile(regex) # compile the regex

   # get all the start and end index of the matches

   m = r.search(test)

   # get groups and group dict

   groups = m.groups() if m is not None else ()

   group_dict = m.groupdict() if m is not None else {}

   # get all the matches

   matches = [[match.start(),match.end()] for match in r.finditer(test)]

   # set the main color

   colored_match = f'{main_color}'

   last_end = 0

   # colorize the matches with match_color otherwise use the main_color

   for start, end in matches:

     # part without match

     colored_match += test[last_end:start]

     # part with match

     colored_match += f'{match_color}{test[start:end]}{main_color}'

     last_end = end

   # part without match

   colored_match += test[last_end:]

   n_match = len(matches)

   return n_match, groups, group_dict, colored_match


if __name__ == '__main__':

   filename = 'file.txt'

   while True:

     try:

       regex, tests = get_data(filename)

       output_str = SGR.render_element('Input regex', regex)

       for test in tests:

         result = test_regex(regex, test, SGR.reset, SGR.bg.green)

         n_match, groups, group_dict, colored_match = result

         output_str += SGR.render_element('\n Test', test)

         output_str += SGR.render_element('Colored match',

                          colored_match, start='\t')

         output_str += SGR.render_element('Number of match',

                          n_match, start='\t')

         if len(groups):

           groups_str = f'({", ".join(groups)})'

           output_str += SGR.render_element('Groups',

                            groups_str, start='\t')

         if len(group_dict):

           group_dict = [f'{k}: {v}' for k, v in group_dict.items()]

           groups_dict_str = '{' + ", ".join(group_dict) + '}'

           output_str += SGR.render_element('Groups dictionnary',

                            groups_dict_str,

                            start='\t')

       print(output_str)

     except Exception as e:

       print(e)

     finally:

       time.sleep(0.5)

       os.system('clear') # 'clear' on Linux and 'cls' on Windows


For those who do not wish to understand the code provided, you can go directly to the next section. In this code we find a SGR class that simply allows me to change the graphical output on the console in a more light way. If you want to know more about the console display you can visit this link. Then the get_data() function simply returns the content of the file where we write our regex and tests. The test_regex() function colors the matches, returns the number of matches, groups and dictgroups. You will understand these notions in this tutorial. Finally the program runs by updating the text file so that we can see how the regex works in live (obviously we have to save the text file if we want the program to take the update into account).


Basics


Well, we should start simply. Let's say we're looking for all the letters e. Then the associated regex is simply e. If we look for all occurrences of the string el then the associated regex is simply el. Note that regex are case sensitive. In other words, e is different from E.

I invite you to try by yourself as much as possible this notion by trying for example the regex hello or the regex Welcome. Respectively we will have 0 and 1 match cumulated in our 3 tests.




Ranges


So far, nothing complicated, we can even do without regex. It is from this point on that the regex take all their interest. Let's imagine that we want to count the number of vowels in our texts. Then we can't write the regex aeiouAEIOU (case sensitive) to accomplish this task. So we have to introduce the notion of interval. Intervals are written in brackets [...]. To match with all the vowels we need the regex [aeiouAEIOU]. If we had to make a human reading of it, we say that we are looking for all the characters a or e or i or ... or O or U.



If we now wish to have all capital letters from D to W we will have the regex [D-W].



In order to familiarize yourself with this interval concept, you can try out new regex by using the intervals in the attached sheet.



Character classes


If you have tested different intervals by yourself, you have probably had a lot of fun writing the regex [a-z]. To have all the letters of the alphabet in upper and lower case you would have to write [a-zA-Z]. I'll grant you that's pretty bad for not much. That's why there are different classes to our regex. To replace [a-zA-Z] we have \w. The other classes are defined in the attached file. It should be noted that each class has its negation. That is to say that if we want everything except letters then the regex is written with a capital letter, we have \W. Note that \s corresponds to a space and the . means that we want any character in the place except the line break \n. So to have all the occurrences of a character other than a letter and which is followed by any three other characters, we have the regex \W... .




Quantifiers


In the previous example we have the regex \W... to have strings of 4 characters that doesn't start with a letter. Let's imagine that we want the same thing but with 10 characters where the first character is not a letter. We could write \W....................... but obviously you see the problem with that. So we have quantifiers that help us to, as the name suggests, quantify. For example, if we want to match all the words of exactly 4 characters we can write \w{4}. Note that the quantifier applies to the character it follows.

It starts to make quite a lot of notions now but with the cheatsheet next to you, nothing will be difficult. I invite you to look at this sheet for the example which will follow. Let's imagine that we want to match pairs of words where the first one has 2 letters or less. We will go step by step. We will consider that before a word there is a space. So our regex starts with \s. Then we want to have a word with two letters or less, so we have \w{,2}. Then we have another space to separate the first word from the second word. For this second word we don't know its size but we know it has at least one character so we have \w+. Then we end with a space \s. Nothing very complex after all. Put end to end we get the regex \s\w{,2}\s\w+\s.




Groups


If you looked at the code I provided you will have noticed that there was a notion of group. To make it simple let's imagine that we want to extract the greetings Hello, Welcome and Hey. We could do the 3 regex [Hh]ello, [Ww]elcome and [Hh]ey one by one. Note that in these regex we take into account that the words may or may not start with a capital letter. But we can gather these 3 regex into one by forming what is called a group noted (...) with the regex ([Hh]ello|[Ww]elcome|[Hh]ey).



The advantage of groups is that they can be returned in variables. This has been done in the display you see above. You can also use dictgroups. That is to say groups which are returned to us in the form of a dictionary. Just change the beginning of the group as shown in italics (?P<greetings>[Hh]ello|[Ww]elcome|[Hh]ey). So you end up with a dictionary with the key greetings to access the match.



You can also decide not to return the group with the following format (?:[Hh]ello|[Ww]elcome|[Hh]ey). But once again I invite you to try it by yourself using the sheet.



Assertions


To finish the enumeration of what can be done with regex I would like to introduce the assertions. The aim of this part will be to reproduce the regex below. Don't panic, it's long, ugly, and many other similar qualities, but it's not very complex when you take the time to think about it.



In this regex we want to extract the last name and the first name, if it exists, in a string in case this name is directly preceded by the word with. Let's start simply: let's consider that a name is a word composed of letters and therefore the first one is in upper case. We then have [A-Z][a-z]+. Then to make this name go up in dictgroup, the way is simple, we have (?P<lastname>[A-Z][a-z]+). To deal with difficult cases we will consider that a first name can be written in different ways. Let's imagine a certain Michael-Philip Johnson. The first name can also be written: M.-P. or MP or M. or M-P etc. So to have only capital letters or a dot or a hyphen, followed potentially by lower case letters, we have the regex [A-Z\.\-]+[a-z]*. Obviously we want to rewrite it as dictgroup. Noting that the first name is not necessarily written we place a conditional on this group with a ?. We have the regex (?P<firstname>[A-Z\.\-]+[a-z]*)?. We can concatenate the two by separating them with a conditional space \s?. And all we have to do is define the condition is directly preceded by with. Which, literally, is a positive lookbehind noted (?<=...). Positive because of the "is" instead of "is not" and lookbehind because of the "preceded" instead of "followed". We end up with the regex (?<=with\s)(?P<firstname>[A-Z][a-z]+)?\s?(?P<lastname>[A-Z][a-z]+). Once again, there are many assertions, and for those ones, it will be really necessary to practice to make it fluent. But what you have to remember is that to form a regex, you have to decompose and take your time.


Now at this point you are self-sufficient in regex. To train you I propose you to implement the method get_data(filename) that I gave you above. For the moment this method works only if the regex is on the second line and the list of tests is on the fifth line. We could do a lot cleaner using regex to make it work with data extraction.



Flags


Finally there are many flags. I won't explain how to use them since they are used differently in different programming languages. They are of course present in the cheatsheet. For example if you want to ignore the case sensitive you can use the I flag or if you want to allow comments and spaces in the regex you can use the X flag. (examples for the python re package).