Small Models, Big Impact: How 1.5B Parameter LLMs Approached Human Performance on Data Extraction from Financial Statements

SECTIONS

The Northwestern Evidence: Small Models, Competitive Results Why Size Doesn't Matter for Structured Financial Data Architecture and Training: Building Your Specialized Model The Economics of Deployment: Small vs. Large Implementation Roadmap for Finance Teams

GPT-4's trillion-plus parameters are not required to approach human analyst performance on structured financial tasks. Recent research from Northwestern University found that a 1.5 billion parameter model came within 4 F1 points of human analyst performance on an EPS-direction prediction task—without access to narrative context, company identity, or external market data.

The Northwestern Evidence: Small Models, Competitive Results

The Northwestern study tested 1.5 billion parameter language models against human financial analysts on EPS-direction prediction from anonymised historical financial statements. Qwen2.5-Coder, the best of the three 1.5B models tested, scored 68.44% F1—below both GPT-4 (73.28%) and the human analyst baseline (72.71%), but within striking range despite being roughly 1,000× smaller.

The numerical structure of financial statements—balance sheets, income statements, cash flow reports—contains sufficient signal for smaller models to produce competitive predictions, even without narrative context or broad world knowledge.

The study demonstrates that for structured financial analysis, the relationship between model size and performance isn't linear.

Why Size Doesn't Matter for Structured Financial Data

The Nature of the EPS-Direction Task

Predicting EPS direction from historical financial statements isn't open-ended analysis. The task uses highly structured numerical data—two years of balance sheet data and three years of income statement data—with consistent formatting and standardized accounting principles. The task calls for recognising trends across reporting periods, interpreting liquidity and efficiency ratios, and spotting unusual account relationships—none of which requires cultural knowledge or open-ended reasoning.

These are fundamentally different from the broad reasoning tasks that motivate massive model architectures. A model doesn't need to understand cultural references or reason about abstract concepts to determine whether a company's current ratio indicates liquidity concerns.

What Small Models Actually Need

A 1.5B parameter model has enough capacity to learn GAAP reporting structure, the accounting identity linking assets to liabilities and equity, and how those metrics shift year-over-year. It does not need conversational fluency or world knowledge to do this.

On this task, chain-of-thought prompting over raw numerical data was sufficient—the models had no world knowledge or conversational ability, and none was needed.

Architecture and Training: Building Your Specialized Model

A 1.5B parameter model optimised for one structured task has different infrastructure and evaluation requirements than a general-purpose API.

The study used instruction-tuned decoder-only models—Qwen2.5 variants and DeepSeek-R1-Distill—at 1.5B parameters, with default weights loaded from Hugging Face and no fine-tuning. All experiments ran on a single NVIDIA Quadro RTX 8000 (48GB VRAM). Despite never being trained on financial statements, the models still approached the human analyst baseline through chain-of-thought prompting alone.

The Economics of Deployment: Small vs. Large

Cost Analysis at Scale

At production volumes, self-hosting a 1.5B model on a single high-end GPU—the setup used in the Northwestern study—has a markedly lower per-unit cost than frontier model API calls, with the gap widening at high throughput. Sub-second inference also means the model can sit in the critical path of a decision rather than running as an overnight batch job.

When to Choose Small

Specialized small models suit high-volume, repetitive tasks where per-unit cost matters and the document structure is predictable.

Large general-purpose models remain the better choice when the document type is unfamiliar, the analysis is open-ended, or the task requires external knowledge the small model was never trained on.

Implementation Roadmap for Finance Teams

A practical path for finance teams is to start narrow—pick one analysis task, establish a performance baseline, then test open-source instruction-tuned models (the Qwen2.5 variants and DeepSeek-R1-Distill used in the Northwestern study are a reasonable starting point) before investing in fine-tuning.

For predicting EPS direction from anonymised balance sheet and income statement data, a 1.5B model without fine-tuning came within 4 F1 points of GPT-4. That is a narrow but concrete result. Organizations deploying AI at scale should evaluate task-specific small models before defaulting to frontier LLM APIs—the Northwestern evidence suggests that for this earnings-direction task, 1.5 billion parameters may be sufficient.

Author: Martin Goodson

Martin is a former Oxford University scientific researcher and has led AI research at several organisations. He is a member of the advisory group for the University College London generative AI Hub. In 2019, he was elected Chair of the Data Science and AI Section of the Royal Statistical Society, the membership group representing professional data scientists in the UK. Martin is the CEO of the multiple award-winning data extraction firm Evolution AI. He also leads the London Machine Learning Meetup, the largest AI & machine learning community in Europe.

‍