Blogs

Data Extraction: Overcoming Challenges in Document Processing

Written by Randy DuFault | Apr 26, 2022

In the world of engineering, especially in software development, there is a constant demand to simplify complex processes. The goal is to create systems that deliver impressive results with minimal effort, guided by intuitive actions. However, achieving simplicity is not always an easy task.

In this blog, we delve into the challenges faced by software publishers, particularly in the context of building user-friendly systems, extracting structured data, and tackling complex table data. Join us as we explore the intricate landscape of simplifying complexity in software engineering.

Simplicity on a Large Scale

Simplicity is mandatory for software publishers building systems intended for large, diverse audiences of consumers. Android and iOS are perfect examples of Google and Apple, spending vast sums of time, energy, and money to ensure that billions of users can effectively use smartphones. One can only imagine the customer service nightmare and resulting brand damage if they released a particularly clumsy or challenging-to-navigate feature. They must get it right every time.

Software publishers in other technology spheres want to be like the big boys. However, simplifying complex operations requires an incredible amount of complex development. Extensive development is costly, and sometimes, simplification is not possible.

Extraction Made Easy

For example, systems are designed to extract structured data from unstructured content like paper forms and PDF files for automated document processing. Often referred to as OCR platforms, these applications look to reduce the human effort needed to convert what exists on a document into forms that downstream applications can process.

The general idea is simple. Use some sort of image-matching algorithm to convert images of text into digital characters (or extract the text from a PDF file in some sort of proscribed fashion), figure out what it is that you need or want to extract, locate the data elements, add data labels, and send it off to the next application.

It all works well in a perfect world. However, our world is less than ideal. The quality of the original form is always less than desirable: data elements shift on the page, character recognition is not perfect, applications generate PDF files in odd formats, senders constantly reorganize forms, and so on.

Tabling the Topic

Then there is the challenge of extracting table data. Most of the documents flowing between businesses consist of table data—lists of elements that make up the shipment, invoice, or other documents. Extraction systems must make sense of the table, while considering the variability of originals, and separate each set of lines into usable data elements. It is a difficult problem requiring complex solutions.

In some use cases, tables are a well-understood problem. For example, the sheer volume of invoices involved when companies across the world bill each other resulted in several highly productive, preconfigured systems designed to accurately extract invoice data. Every major extraction software provider offers one or more pre-configured invoice extraction systems.

Small Volume, Large Issue

However, document flows in other use cases are not as ubiquitous. Lower overall volumes cannot support the sort of development efforts enabled by the worldwide quantity of invoices, leaving user companies to conduct often-complex extraction development on their own.

Extraction platform vendors try to make that development simple. They create point-and-click interfaces and advertise that minimally trained administrative staff can create extraction definitions that work perfectly all of the time. But remember that OCR systems are not smartphones; resource constraints always confine what a vendor can do. The resulting simple systems end up with significant limits on what will extract. They also limit the techniques used by the underlying extraction. Extraction performance for anything but the best originals is dismal and, of course, there are limits on what data can extract.

A Partner to Make the Complex Simple Again

Complex extraction applications require complex development. Developers must deeply understand the underlying software and must have the experience to understand practices that best implement the extraction requirements.

Typically, extraction performance is exceptionally good for a well-designed and properly developed application. Complex forms, complicated tables, and poor originals extract at rates that dramatically improve process efficiencies and dramatically reduce labor costs.

In many cases, that performance will decrease over time. The causes for decline are numerous, so applications require constant administrative attention and occasional development attention. Extraction applications are living, breathing things.

Watch any demonstration of a simple-to-configure extraction system with a critical eye. Challenge the vendor by showing how your complex and less-than-perfect originals will perform. Understand the options and system features available to address mediocre performance.

More than anything else, establish a relationship with an experienced extraction integrator. Their consultants know and understand the internals of the software platform and understand best practices. A good partnership goes a long way toward achieving the performance businesses need and want from an extraction application.