Using REGEX Regular Expressions for Automated Data Collection and Extraction (Part 2)

Through automated processing and reading of data with artificial intelligence (AI), a document management system such as PaperOffice DMS can help you reduce your business costs by up to 92% and increase operational efficiency.

In the current article, we show you how you can use regular expressions to enjoy the benefits of automated document processing. This applies in particular to documents from companies in any industry.

We show you exact examples of regular expressions and explain step-by-step what they mean and how you can use them.

In this way you can increase your operational efficiency, reduce human error through higher accuracy, lower your current costs, maintain data integrity and improve data security.

The current article extends the first part about intelligent document processing, here takes you directly to the article.

Extracting specific data elements from documents can be an extremely expensive and time-consuming task. Frequently, scans of documents are sent to large outsourcing data entry companies, where the data is entered by hand.

However, there are several disadvantages to this approach, as follows:

This can jeopardize document security
A delay is introduced in workflow processes
Compared to automated extraction, manual indexing is a slow process
Manual indexing doesn't scale well on large projects
Manual indexing may introduce errors into the data
If a document is changed, the entire process starts all over again

And many more.

Despite the proliferation of scanning, a large proportion of business transactions are still based on paper-based documents. It is estimated that 85% of invoices are still issued on paper.

In addition, there are mountains of existing paper that have to be stored in huge warehouses!

What is a regular expression?

Regular expressions, also known as "REGEX" are a powerful tool for searching and manipulating text. They make it possible to recognize and edit complex patterns in text.

A regular expression consists of a combination of normal letters and special metacharacters that have special functions.

Regular expressions can also be used to replace or manipulate text. For example, a regular expression can be used.

They are a very powerful tool for word processing and task automation.

How can regular expressions help automate a business?

The increase in digital documents of different types, different naming rules and without a sufficient search system complicates the search process and the process of extracting document information from certain content, especially when it comes to unclassified documents, the search becomes imprecise and takes a long time.

Regular expressions (regex) provide a fast and powerful way to find, extract, and replace specific data in documents. Regular expressions are essentially a special text string used to describe a search pattern.

This is how the document content is searched for and read out for a specified character string. Regular expressions are a way of defining patterns in information using special symbols.

The Regex method is best suited for documents in which the positions of the values to be read can vary and simple document templates cannot work.

You can find a list of simple expressions in our ComDesk.

Extensive expressions can be used from the PaperOffice Regex example collection

How can I build regular expressions?

Regular expressions can be assembled in different ways, depending on what type of pattern is being searched.

Use metacharacters such as ., *, +, ?, ^, $, [], and [a-z] to represent specific types of characters or patterns.

Use optional parts: Use the question mark (?) or asterisk (*) to make parts of the pattern optional.

Use groups: Use parentheses to group parts of the pattern and treat them as a unit.

It's important to note that regular expression rules can vary by programming language. So it is important to read the documentation of the tools used. The RegEx written for PaperOffice must be compatible with ECMAScript and PCRE2.

Tip

There is also a video on YouTube on the subject of "Automated Document STORAGE Part 3 / REGEX & Variables / Invoice Processing Document Management", which explains this process easily and clearly explained:

How do I extract information from my document using REGEX?

Practical examples

In the current article, we demonstrate how you can extract any data from the document thanks to multi-element regular expressions in PaperOffice and automatically store it as a keyword for the document.

We have created a sample document below that has a specific date. This document is an invoice. The date pattern on our document is formatted like this:

Extract information automatically from invoices

Month, composed of letters, but the first letter is always capitalized, followed by a space, then the day followed by a comma, another space, and then the year.

For example: Sep 20, 2019 or Mar 05, 2022

To extract this date we can use the following regular expression (REGEX):

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s(20\d {2})

Let's break down the expression into individual groups. These groups are separated by single brackets ().

In the first group we look for the 3 month letters: ([A-Z][a-z]{2})

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract month

[A-Z] This string means we are looking for a capital letter from A-Z. For example, the letter "S" from Sept. It should be noted that upper and lower case letters are treated separately.
[a-z]{2} This string means that we are looking for two lowercase letters from a-z. That would be ep from "Sep".

Then we look for a space with the following string: \s

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract date

In the second group look for the designation of the day in numbers: (0[1-9]|[12][0-9]|3[01])

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

The day must be separated by three different statements.
Since we don't know what date can appear in the document, it can be the first day (01) or the last day (31) of the month, so you have to accordingly different options are named.
These are separated with the character "|".
Example: (1|2|3) = 1 or 2 or 3.

A list of allowed characters follows in square brackets. Multiple square brackets match multiple characters. If an expression is to describe several characters, these are simply attached one after the other. Then the input is compared to your expression from left to right.

Of course, not all numbers have to be listed. Overall, however, the entire bracketed expression stands for only one character.

0[1-9] This string means that the number can start with a "0" followed by a number from 1 to 9. So we get any number from 01 - 09.
The string looks for a number pattern that starts with a zero. If your document normally has a date "5. March 2022", i.e. without the number "0" in front of the number "5", the "0" in the character string is omitted.
[12][0-9] This character string means that the number can start with a "1" or a "2", followed by any number from 0 to 9. The result can be a any number from 10 - 29 come out.

3[01] This string means that a number could start with a "3" followed by a "0" or a "1". The result could be 30 or 31.

After the options for the day have been defined, the expression for the year should be determined.

Now we look for the comma and the space: ,\s

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract year

In the last group we look for the year: (20\d{2})

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s(20\d {2})

We start looking for any year, but know that it will be ≥ 2000.

20 This string means we're looking for any year starting with exactly 20.

\d{2} This string means that we are looking for a possible two-digit number, i.e. from "00" to "99".

For example, the character \d matches a digit between 0 and 9, while the character \d{2} matches a two-digit number.

Variables are read from the document and made available

If the regular expression is now used in PaperOffice, the end result is the date "Sep 20, 2019".

In this way, any date can be read out of a document without us knowing the original value. These groups can also be used anywhere else and moved freely to read other date formats.

Here is another example:

Invoices with different formats can be easily read

The date starts with the day, followed by the month, made up of letters, but the first letter is always capitalized, followed by a period, another space, and then the year.

To extract this date, the regular expression (REGEX) just described can be used, with an additional completion, because in the second example the "dot" is given after the month.

This can be specified with the following character string: \.

So the complete expression looks like this:

(0[1-9]|[12][0-9]|3[01])\s([A-Z][a-z]{2})\.\s(20\d{2 })

You can always validate your created regex by going to the https://regex101.com page to validate it along with your Insert text. Regex101 will not only check if your regex is correct, it will explain most of the regular expression to you.

And so you can use the different character sets for anything.

Read order number thanks to REGEX

As another example, we would like to read the order number from the document.

Order numbers are extracted from the document

The order number on our document is formatted as follows:

This always starts with the capital letters XYB, followed by a hyphen, followed by 8 digits, another hyphen and finally 3 random capital letters.

Examples of order numbers would be:

XYB-12316723-LSH

XYB-98456723-JRD

To extract this order number we can use the following regular expression:

XYB-\d{8}-[A-Z]{3}

Let's break down the expression one by one.

First we look for exactly the first 3 capital letters with the dash symbol: XYB-

XYB\d{8}-[A-Z]{3}

After that we look for 8 digits followed by another hyphen: \d{8}-

XYB-\d{8}-[A-Z]{3}

The \d character, as previously described, matches a digit between 0 and 9, while the \d{8} character matches an eight-digit number.

And finally we are looking for any 3 capital letters: [A-Z]{3}

XYB-\d{8}-[A-Z]{3}

This is how PaperOffice would look like the following order numbers:

XYB-12316723-LSH

XYB-98456723-JRD

XYB-975432671829

ZYB-12342176-ZHD

the first two XYB-12316723-LSH and XYB-98456723-JRD

recognize.

We have prepared a link to Regex101 for this example, in which the regular expression just described is listed with 4 examples. You can see that only two of the order numbers given meet our requirements.

Read article numbers thanks to REGEX

The article number on our document is formatted as follows:

This always starts with two capital letters, followed by a hyphen, followed by 6 digits.

Various item numbers can be read from invoices

Examples of item numbers would be:

MS-863398

DS-452829

To extract these article numbers, we can use the following regular expression:

[A-Z]{2}-\d{6}

PaperOffice can digitize your documents and integrate with your systems to automate data extraction from invoices and other documentation without having to write and then maintain tons of code.

Contact us to talk about your use cases and learn more about how PaperOffice can help you become even more competitive in the digital age.

Getting started is easier than you think.

Are you still worried about not making it? Read case studies from our customers about the PaperOffice integration into your business life and convince yourself of the simplicity or simply apply for a test installation.

Using REGEX Regular Expressions for Automated Data Collection and Extraction (Part 2)

What is a regular expression?

How can regular expressions help automate a business?

How can I build regular expressions?

How do I extract information from my document using REGEX?

Practical examples

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s(20\d {2})

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract month

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract date

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract year

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s(20\d {2})

(0[1-9]|[12][0-9]|3[01])\s([A-Z][a-z]{2})\.\s(20\d{2 })

Read order number thanks to REGEX

XYB-\d{8}-[A-Z]{3}

XYB\d{8}-[A-Z]{3}

XYB-\d{8}-[A-Z]{3}

XYB-\d{8}-[A-Z]{3}

Read article numbers thanks to REGEX

[A-Z]{2}-\d{6}

Intelligent Business Automation

Accelerating Data Processing

Increasing data efficiency

Simplifying Complex Workflows

Innovative construction industry through modern document processing

Intelligent Document Processing for Industry

New standards in the construction industry with intelligent document processing

Intelligent document processing for engineering firms

Increasing data efficiency

Improving Patient Care

Document processes now faster and error-free

Streamlining Digital Transformation

Streamlining Complex Data

Improvement of Data Efficiency