Despite significant technological advances in 2023 (version three of ChatGPT and other open-source AI models, augmented and virtual reality, and so forth), one antiquated practice seems to persist - the use of paper documents.
Over the last four decades, paper usage has grown by 400%. Paper documents prosper poor-quality scans, which necessitate difficult data extraction. Batch extracting data from PDFs is an essential task for businesses in most industries. There are three primary ways to extract PDF data: manual, OCR and AI-based data extraction.
In this article, we’ll examine each approach’s merits and potential pitfalls using invoices as an example. Invoices are versatile documents with a strong use case - every year, over half a trillion invoices are circulated (of which only 10% are electronic).
Manual data extraction is the most convenient approach in terms of setup. All that’s needed is a human operative and a device to capture data. Consequently, manual data extraction is an attractive option to many small businesses unwilling to spend time or money on a technological solution.
However, manual data extraction’s seeming convenience is undermined when considering the validation required. Manual data extraction is only as accurate as the human operative, meaning errors can infiltrate the data. Manually extracted data is therefore accepted to reach only a threshold of up to 99% accuracy (read more about it here).
A 1% error rate may not seem concerning, but it's a matter of context. For some industries - commercial lending or healthcare being two obvious contenders - a single error can have ruinous financial and interpersonal consequences. It has also been suggested in a paper published by the University of Hawaii that the error rate of human data entry into a spreadsheet can reach up to 40%.
Attempts to mitigate potential errors are generally more effort than an automated technological solution. Some companies deploy extra checks or double-key entry.
For this reason, manual data extraction is an inconvenient and relatively expensive data extraction option.
OCR, or Optical Character Recognition (or text recognition) software, was considered a revolutionary technology around the 2000s. OCR operates through creating a template of characters or document types and applying this visual comprehension.
The OCR vs. IDP (intelligent document processing) debate has been brewing for some years now. Most OCR technology shows a significant inflexibility when it comes to learning. Give a standard OCR system a particularly complex document; even if it’s received prior training, it will struggle.
If you're reading a PDF of a cheque, you can use OCR to do it because the information across all cheques looks the same - they're practically identical, so you know exactly where to look for each piece of information. But for anything that doesn't look like that, which is 99% of the documents in the world, OCR is not fit for purpose. Martin Goodson, CEO/Chief Scientist, Evolution AI
Enterprises that receive high volumes of near-identical documents will likely find OCR an acceptable solution. However, given how few companies this describes, it’s clear that OCR is becoming an outdated investment.
On the other hand, OCR is superior to manual data extraction in terms of employee productivity, as it allows employees to focus directly on the information.
“From a senior manager perspective, I'd rather people were focusing on whether the deal feels right rather than keying a chassis number or a registration number from an invoice.” Adam Crockford, Senior Automation Manager, Novuna Taken from this webinar
Equally, however, if employees spend a significant amount of time correcting the output of the OCR automation solution, this is also a poor use of their time.
The last few years have seen the fruition of AI-based data extraction tools. Data extraction using AI is an ideal use case for AI technology. Though large models like ChatGPT are still in development, data extraction is the low-hanging fruit of AI. There’s no waiting around for a better solution to be developed: it’s already arrived.
Intelligent data extraction uses NLP (natural language processing) to comprehend the meaning of information, allowing it to easily extract the required information.
Consequently, AI-based data extraction solutions are both inexpensive and low-effort. Past the initial implementation, AI should be low-maintenance. Additionally, most AI providers operate with a yearly contract - a tiny sum compared to the price of full-time employees (not to mention costly errors).
In some cases, using AI to extract data from PDFs will require training the model. Zero-shot learning - or deep learning without any training documents for reference - is possible (try Evolution AI’s zero-shot model for free here). More complex or unique documents may require a short initial training period (up to 2 days).
Attached to AI is also a certain stigma derived from fear-mongering. Some employees also resist AI, concerned that they might lose their jobs. Though it is rapidly becoming a cliche, AI-powered automation frees up employees for higher-value tasks. In the long term, AI often means that firms can grow faster while recruiting slower.
AI data extraction software is a future-friendly solution to an old problem. Most AI solutions not only target paper scans but also electronic PDFs and web text. Its versatility considerably outstrips OCR while keeping costs and inconvenience to the minimum.
___
Manual capture is viable but extremely slow as a form of data extraction from invoices.
For example, DF Capital found that extracting from one invoice took 20 minutes, including checks. Since industries like commercial finance rely on expedient decision-making, slow data extraction can be a severe hindrance.
Though OCR invoice extraction is faster than manual data extraction, it is a limited solution for the same reason: invoices can be complex documents and contain a multitude of important data fields. Automating data extraction from invoices using OCR can be extremely cumbersome, as it will only succeed if the invoices all have identical layouts.
For invoices, AI is a superb option, owing to its ability to discriminate between meanings and accurately record data.
AI invoice data extraction combines OCR and deep learning methodologies, training neural networks to mimic the human brain in hierarchising data.
Invoice information extraction using OCR and deep learning is, therefore, far more powerful than OCR alone. AI-based invoice data extraction processes information, understands its meaning and learns from its mistakes. For example, if the AI model confused the invoice and order number during the training process, it would never repeat this mistake across all future documents.
AI data extraction software will generally only take a few seconds to produce the extracted data and run it by external and internal systems (for more information, see our ‘Invoice Solutions’ page).
To get the most from AI data extraction, however, select a provider that prioritises the development of their proprietary technology. Obsolescence is a recurring concern with AI. It’s also good practice to opt for a provider that can tailor their product to your enterprise’s use case, adding and amending features where necessary.
In summary, while manual data extraction and OCR have their merits, invoice data extraction using AI is the most future-friendly solution to a recurring problem across industry.
Interested in discovering more about approaches to data extraction? Check out:
How to Extract Financial Data from PDFs
The Real Cost of Manual Data Extraction