How do I extract information from my document using REGEX?
Practical examples
In the current article, we demonstrate how you can extract any data from the document thanks to multi-element regular expressions in PaperOffice and automatically store it as a keyword for the document.
We have created a sample document below that has a specific date. This document is an invoice. The date pattern on our document is formatted like this:
Extract information automatically from invoices
Month, composed of letters, but the first letter is always capitalized, followed by a space, then the day followed by a comma, another space, and then the year.
For example: Sep 20, 2019 or Mar 05, 2022
To extract this date we can use the following regular expression (REGEX):
([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s(20\d {2})
Let's break down the expression into individual groups. These groups are separated by single brackets ().
In the first group we look for the 3 month letters: ([A-Z][a-z]{2})
([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})
Extract month
- [A-Z] This string means we are looking for a capital letter from A-Z. For example, the letter "S" from Sept. It should be noted that upper and lower case letters are treated separately.
- [a-z]{2} This string means that we are looking for two lowercase letters from a-z. That would be ep from "Sep".
Then we look for a space with the following string: \s
([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})
Extract date
In the second group look for the designation of the day in numbers: (0[1-9]|[12][0-9]|3[01])
([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})
The day must be separated by three different statements.
Since we don't know what date can appear in the document, it can be the first day (01) or the last day (31) of the month, so you have to accordingly different options are named.
These are separated with the character "|".
Example: (1|2|3) = 1 or 2 or 3.
A list of allowed characters follows in square brackets. Multiple square brackets match multiple characters.
If an expression is to describe several characters, these are simply attached one after the other. Then the input is compared to your expression from left to right.
Of course, not all numbers have to be listed. Overall, however, the entire bracketed expression stands for only one character.
-
0[1-9] This string means that the number can start with a "0" followed by a number from 1 to 9. So we get any number from 01 - 09. li>
The string looks for a number pattern that starts with a zero. If your document normally has a date "5. March 2022", i.e. without the number "0" in front of the number "5", the "0" in the character string is omitted.
-
[12][0-9] This character string means that the number can start with a "1" or a "2", followed by any number from 0 to 9. The result can be a any number from 10 - 29 come out.
-
3[01] This string means that a number could start with a "3" followed by a "0" or a "1". The result could be 30 or 31.
After the options for the day have been defined, the expression for the year should be determined.
Now we look for the comma and the space: ,\s
([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})
Extract year
In the last group we look for the year: (20\d{2})
([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s(20\d {2})
We start looking for any year, but know that it will be ≥ 2000.
- 20 This string means we're looking for any year starting with exactly 20.
- \d{2} This string means that we are looking for a possible two-digit number, i.e. from "00" to "99".
For example, the character \d matches a digit between 0 and 9, while the character \d{2} matches a two-digit number.
Variables are read from the document and made available
If the regular expression is now used in PaperOffice, the end result is the date "Sep 20, 2019".
In this way, any date can be read out of a document without us knowing the original value. These groups can also be used anywhere else and moved freely to read other date formats.
Here is another example:
Invoices with different formats can be easily read
The date starts with the day, followed by the month, made up of letters, but the first letter is always capitalized, followed by a period, another space, and then the year.
To extract this date, the regular expression (REGEX) just described can be used, with an additional completion, because in the second example the "dot" is given after the month.
This can be specified with the following character string: \.
So the complete expression looks like this:
(0[1-9]|[12][0-9]|3[01])\s([A-Z][a-z]{2})\.\s(20\d{2 })
You can always validate your created regex by going to the https://regex101.com page to validate it along with your Insert text. Regex101 will not only check if your regex is correct, it will explain most of the regular expression to you.
And so you can use the different character sets for anything.