Mission critical industries including logistics, financial services, government, and healthcare rely on high quality document data extraction in order to optimize business workflows, save money, and improve customer experience. In this article, we explain the conventional approaches for document processing, deep dive into drawbacks with these solutions, and introduce a new approach that enables higher accuracy, even across the most challenging document types.
What is OCR?
Optical character recognition (OCR) is a class of document processing technologies that, as the name suggests, reads characters from text on a page. Originating in the 1920s, OCR has since evolved to support use cases involving data extraction from digitized documents. There are two kinds of OCR technologies:
First, generic OCR, which pulls all of the text on a digitized document (image, PDF, etc.) and outputs a raw list of the words on a page. With a number of improvements in character recognition technology, generic OCR has become suitable for distinguishing words and reading complex font types.
Second, intelligent OCR, which has become a popular alternative since these systems can automate data extraction for key fields of interest rather than just receiving a blob of all the text on a page. Intelligent OCR is built on generic OCR, but enables a user to define a specific template for a document, manually specify all the fields they need extracted, and automatically get only the structured data they need for all similar-looking documents.
Problems with OCR
There are clear challenges with both of these conventional OCR approaches that reduce the quality of data extraction.
Generic OCR simply outputs all the text on a page and doesn’t extract only the key information you might need for downstream workflows. These solutions are typically paired with complex regex or hard-coded heuristics that could require a team of nearly 4-5 engineers just to support parsing out required fields and maintaining the infrastructure. Even then, the variability in text layout and document format ultimately results in poor data extraction accuracy (~50-70% for simple documents, 10%-60% for complex unstructured documents).
Intelligent OCR can extract fields you need in a document, but requires manually creating templates and rules to define the location of those fields for every document layout. This approach can quickly become painstaking and incredibly difficult to scale as organizations expand use cases. These solutions also fail when it doesn’t find a matching template on new layouts or when you need to add new fields to extract—common in real-world scenarios that have high volumes or complex, unstructured document types. While intelligent OCR approaches might get up to 70-80% data extraction accuracy with heuristics, it still requires significant engineering and manual operations support to ensure errors don’t fall through the cracks.
- Generic OCR simply extracts text but doesn’t identify the specific fields you need extracted. This approach requires significant engineering effort to maintain the infrastructure and yields low accuracy data.
- Intelligent OCR can extract your desired fields but requires manually defining templates and hard-coded rules. This approach can get to higher accuracy levels than generic OCR but still results in many errors and doesn’t scale to large volumes and high variability use cases.
Why use Adaptive ML Document Processing instead?
Running OCR is the first of many steps needed to have a robust document processing solution that can extract data at the highest quality. While these solutions may get you part of the way there, they don’t deliver a complete solution because OCR still requires significant engineering and operations support in order to pull out key fields and fix errors.
While we hear about machine learning frequently in moonshot applications like autonomous driving and Artificial General Intelligence, the technology can be significantly useful today in enterprise use cases. Leveraging machine learning is key to getting a high quality and truly automated document processing solution even across highly variable and challenging document types. Rather than relying on off-the-shelf methods or ones that need manually defined templates, an end-to-end machine learning based approach can enable a robust solution with higher accuracy results for highly variable unstructured and structured documents.
However, setting up these systems in-house requires significant investment across the machine learning stack in order to deploy a production system, including data labeling, model training, and continuous monitoring. Choosing the right partner to do this at scale can help you get higher accuracy, lower latency, and save time and money to spend on your company’s core competencies.
Scale Document AI supports customers with quickly deploying a complete document processing solution with guaranteed quality SLAs for any of their document types. We use our own in-house OCR engine for identifying the text on a page, but don’t stop there. What’s unique about our approach is that we leverage base models built on state of the art Computer Vision and Natural Language Processing research. These models are trained on millions of data points and we further refine them for each customer use case. This fine-tuning uses Scale’s proprietary data labeling and ML training infrastructure that we’ve developed over the past few years to deliver a solution that consistently performs at high accuracy – even across unique edge cases, customer requirements, or variable document types.
Scale provides Adaptive ML document processing for many enterprises and organizations across financial services, logistics, manufacturing, healthcare, real estate and government. For example, Scale delivers fast extraction from messy invoices for Brex, accurate and timely data from complicated shipping paperwork for Flexport, and essential data out of titles, mortgage applications, and deeds for Doma in real estate transaction processing.
If you’re looking for a solution that’s higher quality, lower latency, and free of the hassle of creating and maintaining templates, learn more and contact us.