Blogs

What is Document Automation?

Written by Genus Technologies | Aug 12, 2021

The concept of document automation is using a machine instead of a human to identify the document type and capture core business information from the document. Documents today do not typically mean paper. Email, web portals, and other communication mechanisms mean the documents seldom get to a physical form. This conversion from unstructured data into structured data is what we know as document automation.

Document Automation consists of three steps:

  1. Classification, where the machine identifies the document type and separates them into groups for processing.
  2. Next there’s Extraction, where the machine converts a defined set of elements in the document into machine usable form. Typically defining the elements and how they appear on the page is a manual or guided process, but there are ways for the machine to automatically process as well.
  3. At the Validation step, the machine checks its work, applies a level of confidence, and, when needed, gets a human involved to review something in question. If the machine is reasonably confident, you can usually trust that the data is correct. Additional detailed checking like validating formats, or comparing results with known lists, helps boost confidence.

Document Automation Step 1: Classification

Automated classification sorts documents based on type. Typical examples include invoices, claims, or loan documents which are separated and identified by document type.

The usual set of classification techniques includes:

  • Artificial Intelligence (AI) utilizes machine learning techniques used to train a model.
  • Page layout is usually the fastest and least processor intensive of the classification techniques. The pixel layout of the page is compared to a stored representation averaged from several samples. Classification occurs when the two compare at some high level of confidence. This works well on traditional fill-in forms with lines and boxes and on text-based documents that look identical such as form letters.
  • Text-based classifiers use natural language processing (NLP) techniques to understand what a document is. The NLP model can be as simple as finding a common phrase somewhere in the document or finding some meaningful combination of words.

Classification may use any of these techniques or be combined to improve the probability of a correct result. There are rules of thumb, but there is not any substitute for spending the time to develop a strong understanding of the content set and having the product experience necessary to build out an effective solution for extraction.

Document Automation Step 2: Content Extraction

Machine extraction replaces manual data entry processes by pulling fields of information from the document and converting it to structured data. The system will “locate” the data elements needed for processing based on the document type identified in classification. It does that in a variety of ways:

  1. The simplest method is just location on the page. The customer number on Form X always appears in the upper left corner. Extraction uses the location specified during setup and extracts whatever happens to appear in the upper left corner. A fudge factor for the exact position of the value usually accounts for minor shifts caused by scanning irregularities or other document differences.
  2. Another technique is text-based that locates key words such as customer number, or some variant of those words—cust #, cust num, etc. and captures the information around the anchor value. This technique resolves many of the issues with fixed locations as the desired field value could appear virtually anywhere on the page, or anywhere in the document for that matter, and still extract.
  3. Another locating technique uses the expected form of the extraction target. A good example is an address. At least in the U.S., the postal service has strong rules for what an address block must hold, and in what order. Extraction uses those rules to understand that a grouping of words is, in fact, an address and should extract—even if it has a slight variation.
  4. Another useful technique involves locating data elements that match some sort of pattern. Good examples are credit card numbers, phone numbers, and U.S. Federal tax IDs. A challenge, though, is understanding which value is the desired target when several are found on the current page or elsewhere in the document.
  5. Tables, like what appears on an invoice or statement, are a special extraction case. The table column labels and one or more rows of data represent an instance of all those fields. Most systems do a reasonable job detecting and extracting tables if the layout is not overly complex.

There is a level of confidence the machine uses when extracting, then it goes through the validation process — making sure extracted data makes sense and is accurate.

Document Automation Step 3: Validation

When information is extracted, the machine determines whether the values are valid in the context of the application system receiving the data. The simplest kind of validation is checking the extracted value’s format. If a customer number in our enterprise is numeric, seven digits long, with a dash between the second and third characters, and the extracted data matches those simple rules, the extraction most likely worked.

Validation does not just involve the machine; validation is all about checking the work done by the machine or done by the human and is a core part of all document automation applications. A confidence value lower than a set threshold triggers review by a human  to either confirm the correctness of the machine or make a change if the machine was incorrect. Likewise, any failure of a rule-based validation also requires human intervention.

Document Automation Saves Time and Money

In our 25 years of experience, Genus has seen company after company benefit from implementing document automation. Whether you’re in healthcare, finance, banking, insurance, or any other industry, we can help you improve productivity, save time on tasks, and decrease costs.