A hallucination denotes something that looks real but isn’t.
Likewise, an AI-generated hallucination is information that seems plausible yet is fictitious. For example:
The problem with hallucinations is that they generally (and plausibly) resemble the surrounding information. Consequently, users may find it difficult to identify these hallucinations amongst the correct information.
Different types of hallucinations exist. Let’s consider the following illustrative examples:
AI can also produce incoherent or irrelevant information, though arguably these errors are not hallucinations. Instead, they’re simply irrelevant to the prompt.
Have you ever used an AI virtual assistant (like ChatGPT, Gemini or Claude) – tools founded on Large Language Models (LLMs)? Then you’ve likely encountered a hallucination in the generated data without noticing. That’s because generative AI hallucinates frequently.
ChatGPT’s initial version (GPT-3) generated errors at a 3% rate. That means that for every 100 data points, ChatGPT produced three fictitious pieces of information.
Now, machine learning developers are creating innovative strategies for reducing the hallucination rate (more on this later). According to Vectara’s leaderboard, the best-performing AI model – the latest version of Gemini – generates errors at a rate of 0.7%.
Therefore, due to the comparatively low error rate, when users casually interact with these models, it’s not necessarily guaranteed that the models will hallucinate. Yet it can be deeply problematic for the user when they do.
Considering generative AI’s extraordinary capabilities, it may seem pedantic to nitpick the fidelity of its outputs. In other words, when AI can predict neuroscience study results with greater accuracy than humans and aid responses to earthquake damage, it might seem churlish to criticise its ~1% hallucination rate.
Yet, even a 1% error rate can negatively affect many industries. Niklaus Wirth, the inventor of Pascal (programming language, emphasises the importance of precision: ‘In our profession, precision and perfection are not a dispensable luxury, but a simple necessity.’
In other words, accuracy is non-negotiable when navigating complex systems (such as programming, financial modelling, etc). Moreover, computer science is not the only discipline where perfection is paramount. Let’s explore three other professions.
Several horror stories exist of lawyers using AI virtual assistants. In these cases, the hallucinations were quickly identified, with disciplinary action ensuing. Case in point – earlier this year, a judge revoked a lawyer's admission to practice and fined three other attorneys $5,000 for using hallucinatory citations in a case against Walmart.
As an admin and paperwork-heavy profession, it makes sense why lawyers might cut corners using generative AI. However, it's important for AI users to understand what their model can and cannot do, so they can effectively review its output and identify any hallucinations.
An interesting and comprehensive article by The Register titled ‘AI models hallucinate, and doctors are OK with that’, notes that AI excels at diagnosis prediction.
However, the article reflects a growing call in medicine for greater regulatory and ethical guidelines to ensure full accountability for patient safety. This can’t come soon enough, as Apple is currently capitalising on its track record of developing health tools by releasing an AI doctor. In unregulated models, hallucinations could lead to poor medical advice, which can have devastating legal and moral consequences.
Like medicine and law, the consequences of seemingly minute financial errors can be catastrophic. For example, the Mizuho Securities Typo cost a Japanese firm approximately £350 million, just because someone entered a share order incorrectly (another example of why manual data extraction is not fit for purpose…).
We work closely with commercial lenders. Precision is important when they review loan applications to reduce the risk of default. Real people and businesses can suffer from inaccuracies when using tools to read financial documents.
In general, the potential impact of hallucinations could cause the following:
In particular, experts often advise businesses to choose specialised solutions with built-in safeguards against hallucinations, rather than relying on general-purpose LLMs. For example, suppose you're looking to extract data directly from a document, rather than uploading it to a generative AI agent like ChatGPT, which will generate hallucinations. In that case, a user might consider a specialised extraction solution.
However, just because generative AI hallucinates, it shouldn’t deter new users from at least experimenting. In many use cases, the benefits of AI vastly outweigh the inconvenience caused by hallucinations. For instance, if using generative AI for creative tasks (e.g. brainstorming ideas), its versatility and speed are more helpful than factual accuracy. For a user to know how to identify and react to hallucinations, it is helpful to understand why hallucinations occur in the first place.
Generative AI can hallucinate for the following reasons:
Generative AI models learn to make predictions by identifying patterns in their training data. If that data is biased or incomplete, the model can pick up on the wrong patterns, leading to hallucinations or inaccurate outputs.
Hallucinations in generative AI can stem from the methods used during generation. These techniques can lead the model to generate confident-sounding but incorrect (or entirely fabricated) content.
Hallucinations can also arise from the input context provided to a generative AI model. If the prompt is vague or contains incorrect information, the model can fill in gaps with false content.
Other than reviewing the output of generative AI models carefully, there are no convenient ways for users to recognise hallucinations.
Researchers are investigating the possibility of developing ‘hallucination-proof’ AI or building alternative AI models designed to flag hallucinations in generative AI. Such methods would induce a more reliable performance from LLMs.
For example, in June 2024, Nature published a paper suggesting that semantic entropy could detect hallucinations. Semantic entropy involves prompting models using the same input. The meaning of the answers is then compared to determine the semantic entropy – how similar or different the model responses are. A low score indicates similarity, which shows that the model is not hallucinating.
The main difficulty with an approach like semantic entropy is the processing resources. Deploying this method costs ten times the processing power of a standard chatbot conversation.
Nonetheless, there are several strategies for managing AI hallucinations.
So, you’re browsing a generative AI output, and notice that it’s inserted or contrived a plausible (yet false detail). What do you do?
When constructing a prompt, consider prompting the AI to ‘include sources/references’, etc. Therefore, once the AI-generated responses are ready, you can manually review the citations quickly.
Case in point: When we interviewed financial services firm Langcliffe, they used LLMs to generate research on prospects, which included citations. Their experts could then validate the citations with one click.
For outputs where accuracy is crucial, compare them to a trusted source. You may need to recalculate data or quickly check external sources.
If the output includes sources or references, check that the links are valid and lead to the referenced information. Then, verify that the referenced information matches the generative AI’s output.
This may be a less effective strategy, as hallucination rates are similar across generative AI models. Plus, the hallucination leaderboard constantly updates as developers release new models. Nonetheless, you can check the hallucination rate of your preferred model using Vectara’s leaderboard on GitHub.
Whether AI will (and should) hallucinate remains controversial in AI circles. Take, for instance, Maria Sukhareva (an AI expert at Siemens) who says, ‘As soon as you make the model more deterministic, basically you force it to predict the most likely word. You greatly restrict hallucinations, but you will destroy the quality as the model will always generate the same text.’
In other words, by eliminating hallucinations, you’re also stifling AI’s ability to be creative. Whether AI should be creative – or leave that particular effort to humans – is also hotly contested. For example, the right to use AI was a central driver in the 2023 Writers’ Guild of America (WGA) strike.
In enterprise settings, AI’s factuality, as aforementioned, is non-negotiable. Jakob Nielsen has suggested that current research indicates hallucinations will phase out as the parameters of generative AI models are extended to ten trillion parameters, which will happen in 2027. For users, that might mean less harmful and more reliable experiences.
It’s fair to say that nothing is assumed when it comes to the demise of hallucinations. Ulrik Stig Hansen of Encord notes hallucinations as a feature of AI models, rather than a fixable ‘bug’. He points to management rather than prevention, which users might find a helpful way to approach the impressive yet flawed performance of contemporary generative AI.
As a (relatively) nascent technology, generative AI’s hallucinations reflect AI’s potential and the distance it has yet to go. In the meantime, users of generative AI models will benefit from taking an over-cautious approach and reviewing sensitive data carefully. For now, (at least), hallucinations are an inevitability.
Stay updated with Evolution AI’s insights by following us on LinkedIn.