Few would argue that we live in anything but a data-driven world. However, the majority of data, approximately 80%, is unstructured.
Unstructured data lacks a predefined format – typically unsorted, unorganised and stored in its native form. Consequently, unstructured data can come in the form of:
Below, see an example of structured data vs. unstructured data.
Back in 2023, we provided an accessible introduction to converting unstructured data into structured text. Since then, data extraction methods have evolved significantly. Traditional, cumbersome methods of extracting insights from unstructured data – i.e. Optical Character Recognition (OCR) technology and manual data extraction – have been rendered obsolete via AI alternatives.
There’s nothing inherently wrong with unstructured data - other than it is not always readily usable. A historic and much-cited Forbes survey reported that 95% of businesses considered managing unstructured data problematic.
Rather than pay analysts to trawl through volumes of unstructured data, businesses are now turning to faster, more cost-effective solutions. Enter AI, particularly Large Language Learning Models (LLMs).
LLMs are AI virtual agents designed to perform numerous functions. One function includes parsing unstructured data – let’s explore how.
Various LLMs that will extract unstructured data are available for free online, including the following:
Unsure about which LLM could meet your needs the best? Check out our article comparing the strengths of recent, popular LLMs.
Let’s examine how you can start using an LLM to extract unstructured data in three steps.
Usually, creating an account should take less than a minute, and you can do so using a preexisting Google or Apple account, email address, etc. Once you’ve successfully signed up and entered the LLM, you’ll see an interface that looks like this (here, we’re using ChatGPT).
Click the plus icon next to the prompt box to open a pop-up where you can select one or multiple files from your personal device.
Note: Take heed before uploading files containing sensitive data. Sharing sensitive data with AI can cause privacy issues, such as potentially leaking sensitive information into the LLM’s training data.
‘Prompt engineering’ might sound like a complex skill, but it just means telling the LLM what you would like it to output. (Explore our dos and don’ts for prompting here). No need to overthink it. Even someone who has never used AI should come to grips with submitting prompts quickly.
We’d recommend using prompts like:
Remember, the more specific you are, the better the output quality.
ChatGPT will generate a text file that you can instantly download. Or, you can click the ‘copy’ symbol to copy and paste it into the desired location.
Important note: Do not forget to review the outputted information. If incorrect, prompt the LLM to correct itself (e.g. ‘This data point does not seem correct. Please generate another response containing the correct information’).
The same process applies to extracting unstructured data from PDFs – simply upload and prompt. Extracting structured data from PDFs has a particularly strong use case in business since:
In theory, by establishing an automated connection with the LLM, you can extract structured text from high volumes (i.e. millions) of PDFs.
The main drawback of using an LLM is that it isn’t specialised for unstructured data extraction. Accordingly, it will make mistakes along the way. That is a certainty.
Don’t let AI-generated errors ruin your holiday! (Source: Kateryna Stetsiuk)
Hallucinations are the most typical mistakes an LLM will make. Though they can make errors (e.g. copying the wrong digit from a blurry image), a more common error is generating plausible yet fictitious data.
As of March 2025, Vectara’s LLM leaderboard crowns Gemini’s new model – Gemini 2.0 Flash – as the model generating the fewest hallucinations, with a rate of 0.7%. The hallucination rate across the top 25 best-performing LLMs varied between 0.7 and 2.6%.
So, the question is: Can your business tolerate a proportion of errors? Put another way – can your business tolerate roughly one (or more) errors per one hundred data points?
Most likely, no (particularly if you work in finance, healthcare or law). Some say LLMs are useful when no particular ‘right’ answer exists. But when it comes to data accuracy, there is only one right answer – accurate data.
There are two alternatives to using LLMs to structure data – a specialised solution for extracting unstructured data or doing it yourself (i.e. manually).
If you’re dealing with a small volume of unstructured documents, you could consider the ‘old school’ method of structuring the data into a text file yourself.
The problem with manual data extraction is that, although humans don’t ‘hallucinate’ as AI does, they still generate a roughly equivalent percentage of errors. Most sources agree that humans make manual data extraction errors at roughly 1%. However, manual data extraction from documents is incredibly time-consuming and tedious. For example, one of our clients reported that extracting data from one financial statement took three hours.
A (high-quality) specialised data extraction solution often comes with an accuracy guarantee in the service level agreement (SLA). For instance, Evolution AI’s managed service offers complete accuracy across all financial documents – or your money back.
If you want to extract data from financial documents with complete accuracy, email hello@evolution.ai or book a demo with our team.
Here are a few reasons why Evolution AI’s unstructured data solution could be the right fit for you: