The PaperOffice Insider Newsletter
The PaperOffice Insider Newsletter
We want to become friends

Highest possible discount offers.

Exclusive insider news

Free Bonus Upgrades

Highest possible discount offers.

Exclusive insider news

Free Bonus Upgrades

Friendship-Trust-Word of Honor
We will never share your email address with others, and each email includes a 1-click unsubscribe link.

Perform Intelligent Document Processing Accurately


Through automated processing and reading of data with artificial intelligence (AI), a document management system such as PaperOffice DMS can help you reduce your business costs by up to 92% and increase operational efficiency.

Professional tip

Automated data collection with regular expressions: How to efficiently process large amounts of data using regular expressions.


The key to automated data collection and data extraction.

In the current article, we show you how you can use regular expressions to enjoy the benefits of automated document processing. This applies in particular to documents from companies in any industry.

We show you exact examples of regular expressions and explain step-by-step what they mean and how you can use them.

In this way you can increase your operational efficiency, reduce human error through higher accuracy, lower your current costs, maintain data integrity and improve data security.

The current article extends the first part about intelligent document processing, here takes you directly to the article.

Extracting specific data elements from documents can be an extremely expensive and time-consuming task. Frequently, scans of documents are sent to large outsourcing data entry companies, where the data is entered by hand.

However, there are several disadvantages to this approach, as follows:

  • This can jeopardize document security
  • A delay is introduced in workflow processes
  • Compared to automated extraction, manual indexing is a slow process
  • Manual indexing doesn't scale well on large projects
  • Manual indexing may introduce errors into the data
  • If a document is changed, the entire process starts all over again

And many more.

Despite the proliferation of scanning, a large proportion of business transactions are still based on paper-based documents. It is estimated that 85% of invoices are still issued on paper.

In addition, there are mountains of existing paper that have to be stored in huge warehouses!

What is a regular expression?

Regular expressions, also known as "REGEX" are a powerful tool for searching and manipulating text. They make it possible to recognize and edit complex patterns in text.

A regular expression consists of a combination of normal letters and special metacharacters that have special functions.

Regular expressions can also be used to replace or manipulate text. For example, a regular expression can be used.

They are a very powerful tool for word processing and task automation.

How can regular expressions help automate a business?

The increase in digital documents of different types, different naming rules and without a sufficient search system complicates the search process and the process of extracting document information from certain content, especially when it comes to unclassified documents, the search becomes imprecise and takes a long time.

Regular expressions (regex) provide a fast and powerful way to find, extract, and replace specific data in documents. Regular expressions are essentially a special text string used to describe a search pattern.

This is how the document content is searched for and read out for a specified character string. Regular expressions are a way of defining patterns in information using special symbols.

The Regex method is best suited for documents in which the positions of the values to be read can vary and simple document templates cannot work.

You can find a list of simple expressions in our ComDesk.

PaperOffice Regex example collection
Extensive expressions can be used from the PaperOffice Regex example collection

How can I build regular expressions?

Regular expressions can be assembled in different ways, depending on what type of pattern is being searched.

Use metacharacters such as ., *, +, ?, ^, $, [], and [a-z] to represent specific types of characters or patterns.

Use optional parts: Use the question mark (?) or asterisk (*) to make parts of the pattern optional.

Use groups: Use parentheses to group parts of the pattern and treat them as a unit.

It's important to note that regular expression rules can vary by programming language. So it is important to read the documentation of the tools used. The RegEx written for PaperOffice must be compatible with ECMAScript and PCRE2.

Tip

There is also a video on YouTube on the subject of "Automated Document STORAGE Part 3 / REGEX & Variables / Invoice Processing Document Management", which explains this process easily and clearly explained:

How do I extract information from my document using REGEX?

Practical examples

In the current article, we demonstrate how you can extract any data from the document thanks to multi-element regular expressions in PaperOffice and automatically store it as a keyword for the document.

We have created a sample document below that has a specific date. This document is an invoice. The date pattern on our document is formatted like this:

Read out PaperOffice invoice with regex
Extract information automatically from invoices

Month, composed of letters, but the first letter is always capitalized, followed by a space, then the day followed by a comma, another space, and then the year.

For example: Sep 20, 2019 or Mar 05, 2022


To extract this date we can use the following regular expression (REGEX):

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s(20\d {2})

Let's break down the expression into individual groups. These groups are separated by single brackets ().

In the first group we look for the 3 month letters: ([A-Z][a-z]{2})

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract month

  • [A-Z] This string means we are looking for a capital letter from A-Z. For example, the letter "S" from Sept. It should be noted that upper and lower case letters are treated separately.
  • [a-z]{2} This string means that we are looking for two lowercase letters from a-z. That would be ep from "Sep".

Then we look for a space with the following string: \s

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract date

In the second group look for the designation of the day in numbers: (0[1-9]|[12][0-9]|3[01])

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

The day must be separated by three different statements.
Since we don't know what date can appear in the document, it can be the first day (01) or the last day (31) of the month, so you have to accordingly different options are named.
These are separated with the character "|".
Example: (1|2|3) = 1 or 2 or 3.

A list of allowed characters follows in square brackets. Multiple square brackets match multiple characters. If an expression is to describe several characters, these are simply attached one after the other. Then the input is compared to your expression from left to right.

Of course, not all numbers have to be listed. Overall, however, the entire bracketed expression stands for only one character.

  • 0[1-9] This string means that the number can start with a "0" followed by a number from 1 to 9. So we get any number from 01 - 09.

    The string looks for a number pattern that starts with a zero. If your document normally has a date "5. March 2022", i.e. without the number "0" in front of the number "5", the "0" in the character string is omitted.

  • [12][0-9] This character string means that the number can start with a "1" or a "2", followed by any number from 0 to 9. The result can be a any number from 10 - 29 come out.
  • 3[01] This string means that a number could start with a "3" followed by a "0" or a "1". The result could be 30 or 31.

After the options for the day have been defined, the expression for the year should be determined.

Now we look for the comma and the space: ,\s

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s (20\d{2})

Extract year

In the last group we look for the year: (20\d{2})

([A-Z][a-z]{2})\s(0[1-9]|[12][0-9]|3[01]),\s(20\d {2})

We start looking for any year, but know that it will be ≥ 2000.

  • 20 This string means we're looking for any year starting with exactly 20.
  • \d{2} This string means that we are looking for a possible two-digit number, i.e. from "00" to "99".

For example, the character \d matches a digit between 0 and 9, while the character \d{2} matches a two-digit number.

Variables are read from the document and made available
Variables are read from the document and made available

If the regular expression is now used in PaperOffice, the end result is the date "Sep 20, 2019".

In this way, any date can be read out of a document without us knowing the original value. These groups can also be used anywhere else and moved freely to read other date formats.

Here is another example:

Read PaperOffice Invoice 2 with Regex
Invoices with different formats can be easily read

The date starts with the day, followed by the month, made up of letters, but the first letter is always capitalized, followed by a period, another space, and then the year.

To extract this date, the regular expression (REGEX) just described can be used, with an additional completion, because in the second example the "dot" is given after the month.

This can be specified with the following character string: \.

So the complete expression looks like this:

(0[1-9]|[12][0-9]|3[01])\s([A-Z][a-z]{2})\.\s(20\d{2 })

You can always validate your created regex by going to the https://regex101.com page to validate it along with your Insert text. Regex101 will not only check if your regex is correct, it will explain most of the regular expression to you.

And so you can use the different character sets for anything.

Read order number thanks to REGEX

As another example, we would like to read the order number from the document.

Variables are read from the document and made available
Order numbers are extracted from the document

The order number on our document is formatted as follows:

This always starts with the capital letters XYB, followed by a hyphen, followed by 8 digits, another hyphen and finally 3 random capital letters.

Examples of order numbers would be:

XYB-12316723-LSH

XYB-98456723-JRD

To extract this order number we can use the following regular expression:

XYB-\d{8}-[A-Z]{3}

Let's break down the expression one by one.

First we look for exactly the first 3 capital letters with the dash symbol: XYB-

XYB\d{8}-[A-Z]{3}

After that we look for 8 digits followed by another hyphen: \d{8}-

XYB-\d{8}-[A-Z]{3}

The \d character, as previously described, matches a digit between 0 and 9, while the \d{8} character matches an eight-digit number.

And finally we are looking for any 3 capital letters: [A-Z]{3}

XYB-\d{8}-[A-Z]{3}

This is how PaperOffice would look like the following order numbers:

XYB-12316723-LSH

XYB-98456723-JRD

XYB-975432671829

ZYB-12342176-ZHD

the first two XYB-12316723-LSH and XYB-98456723-JRD

recognize.

We have prepared a link to Regex101 for this example, in which the regular expression just described is listed with 4 examples. You can see that only two of the order numbers given meet our requirements.

Read article numbers thanks to REGEX

The article number on our document is formatted as follows:

This always starts with two capital letters, followed by a hyphen, followed by 6 digits.

Read PaperOffice invoice with Regex
Various item numbers can be read from invoices

Examples of item numbers would be:

MS-863398

DS-452829

To extract these article numbers, we can use the following regular expression:

[A-Z]{2}-\d{6}

PaperOffice can digitize your documents and integrate with your systems to automate data extraction from invoices and other documentation without having to write and then maintain tons of code.

Contact us to talk about your use cases and learn more about how PaperOffice can help you become even more competitive in the digital age.

Getting started is easier than you think.

Are you still worried about not making it? Read case studies from our customers about the PaperOffice integration into your business life and convince yourself of the simplicity or simply apply for a test installation.

FAQs

To conclude, we will answer a few commonly asked questions on the topic. "Using REGEX Regular Expressions for Automated Data Collection and Extraction (Part 2)":

Who is a paperless office suitable for?

The quick and easy answer to the question is: for every company. All business sectors and sizes benefit from a paperless office, from SMEs and start-ups to large companies. However, the conversion is particularly valuable for small and medium-sized companies: The reduction in processing effort and costs frees up the budget required for further growth boosters.

Can I use a cloud-based DMS provider for my paperless office?

No. Another factor that has been on everyone’s lips since the GDPR came into force in 2018 at the latest is data protection. DMS solutions and DMS software are used to process, manage and store documents that often contain sensitive, personal data. In the event of violations of the GDPR, the legislator provides for high fines.

Conclusion

  • Benefits justify the effort and costs

    Working digitally and bringing old documents into the new age will be the best key investment to save an incredible amount of time, money and nerves in the future.

  • You need someone who knows

    You don't need your own IT specialist to take advantage of all the advantages of digitization.
    What you need is the right partner at your side who, thanks to his experience, can implement exactly what you need. Avoid scaremongering and choose test positions instead of fancy PowerPoint presentations without having really tested it.

  • The hardware is usually already available

    Experience has shown that almost every operation, company and company has a large copier that does not use its potential. These devices love mass scans, are tolerant of paperclips and can be the basis for a digital start without a scanner investment.

  • Cheaper than expected with the right DMS

    Avoid cost traps with DMS / ECM systems where you are mercilessly at the mercy of the manufacturers. Do not make any compromises when it comes to your own administration options, such as teaching documents and making settings yourself. If you need help, the manufacturer will be happy to help you, but remain independent.

  • Digital automation is the future

    Procedures will be completely identical in the future, but fully automated.
    Invoice coming in? The workflow is triggered and everything goes its defined way.
    Search through all 1000 folders? No problem, because you have your own Google!

PaperOffice solves every problem: Guaranteed.

Case study

Digital specialist solutions for the automation of business processes

"Manually processing the documents in such a large community would have cost us a lot of time.
With the automated solution from PaperOffice DMS, the manual effort could be greatly reduced, at the same time investments were made in future-oriented technology. We are pioneers in digital property management."

Mr. Alejandro Campos
IT specialist and project manager at the property management El Guijo