A radiology study published in Nature Medicine has implications outside medicine. If you're building AI-powered document extraction for financial data, the findings have implications for how teams think about accuracy.
In March 2024, researchers from MIT and Harvard published a study in Nature Medicine that examined how AI assistance affected the diagnostic performance of 140 radiologists across 15 chest X-ray tasks. The top-line finding — AI improved average performance — was unsurprising. What came next was not.
The study found substantial heterogeneity in individual responses to AI assistance — some radiologists got significantly worse with AI help. The critical finding: "inaccurate AI predictions adversely affecting radiologist performance." AI errors actively degraded expert judgement.
Conventional predictors failed to identify who would be harmed. Years of experience, subspecialty training, familiarity with AI tools — none reliably predicted whether a radiologist would benefit or suffer. Lower-performing radiologists did not consistently benefit more, challenging the assumption that AI helps the least skilled the most. This mechanism — errors pulling expert performance downward — matters for anyone deploying AI in data workflows where mistakes are costly.
The radiology findings are a clinical demonstration of a well-documented phenomenon: automation bias. This is the tendency to over-rely on automated recommendations, even when they conflict with available evidence.
The 2026 International AI Safety Report, authored by over 100 experts and led by Turing Award winner Yoshua Bengio, highlighted a striking example: clinicians' tumour detection rates during colonoscopy dropped approximately 6 percentage points after several months of working with AI assistance. The report called this "a sign that people can lose skills when they rely on AI too much."
Now translate this to financial document processing. When an AI extraction system flags a field with high confidence but gets it wrong, human reviewers are primed to accept it. The consequences are specific: fabricated regulatory references — an AI confidently citing "IFRS 99 standard" when no such standard exists [4] — or incorrect numerical extractions from balance sheets. These errors don't stay contained. Extracted financial data feeds risk models, compliance checks, and automated reporting — systems that act on the data without a second human safety net.
The industry conversation around document extraction accuracy focuses heavily on headline numbers. Vendors quote figures like 95% or 99%. But these figures obscure the question that actually matters: what happens with the remaining errors?
When accuracy drops to 80%, two out of every ten data points extracted from a critical document are incorrect, and downstream AI recommendations become untrustworthy [2]. Even at 95% accuracy, one in twenty fields is wrong — and in a structured financial document with hundreds of fields, that means dozens of potential errors per document.
Previously, human operators provided an implicit safety net, catching extraction mistakes before they propagated. Now, extracted data increasingly feeds automated systems directly. Manual data entry error rates of 1–5% [2] provide a baseline, but automation bias can push effective error rates higher when AI outputs are confidently wrong — because confident errors suppress the scrutiny that would otherwise catch them.
The lesson from the radiology study is about system architecture. Design for the error case, not just the happy path.
The first priority is confidence-aware outputs. Every extracted field should carry a calibrated confidence score — a graduated signal that directs human attention to where it matters most. A balance sheet total extracted at 99.8% confidence needs different treatment than a footnote reference at 72%.
Extracted values should also be cross-referenced against known schemas, business rules, and regulatory formats before surfacing results. Financial documents contain internal consistency checks that extraction systems should exploit. A stated total that doesn't match the sum of its components is a validation failure the system should catch before a human ever sees it.
When the model is uncertain, flag it explicitly rather than presenting a best-guess with false confidence. This is the direct antidote to automation bias: surface uncertainty to prevent bad data from propagating downstream. Route data that fails validation to human review rather than allowing it to corrupt downstream applications.
The Nature Medicine study found that experience alone didn't predict which radiologists would benefit from AI. Better system design, not better training, is the fix.
For teams building financial document extraction, the implication is clear: invest as much in how your system handles uncertainty as in how it handles the easy cases.
In financial data extraction, the highest accuracy on benchmarks matters less than how a system contains its errors. As financial documents grow more complex — ESG disclosures, multi-jurisdictional filings, new regulatory formats — error-aware architecture matters more than headline accuracy numbers.
Author: Martin Goodson
Martin is a former Oxford University scientific researcher and has led AI research at several organisations. He is a member of the advisory group for the University College London generative AI Hub. In 2019, he was elected Chair of the Data Science and AI Section of the Royal Statistical Society, the membership group representing professional data scientists in the UK. Martin is the CEO of the multiple award-winning data extraction firm Evolution AI. He also leads the London Machine Learning Meetup, the largest AI & machine learning community in Europe.