Small Models, Big Impact: How 1.5B Parameter LLMs Approached Human Performance on EPS-Direction Prediction

SECTIONS

The Northwestern Evidence: Small Models, Competitive Results Why Size Doesn't Matter for Structured Financial Data Architecture and Training: Building Your Specialized Model The Economics of Deployment: Small vs. Large Implementation Roadmap for Finance Teams Takeaways

You don't need GPT-4's trillion-plus parameters to come close to human analyst performance. Recent research from Northwestern University found that a 1.5 billion parameter model came within 4 F1 points of human analyst performance on an EPS-direction prediction task—without access to narrative context, company identity, or external market data.

For finance teams, this matters: production financial analysis may not require frontier models.

‍

The Northwestern Evidence: Small Models, Competitive Results

The Northwestern study tested 1.5 billion parameter language models against human financial analysts on EPS-direction prediction from anonymised historical financial statements. The results add to a growing body of evidence that larger models do not always outperform smaller ones on constrained, structured tasks: Qwen2.5-Coder, the best of the three 1.5B models tested, scored 68.44% F1—below both GPT-4 (73.28%) and the human analyst baseline (72.71%), but within striking range despite being roughly 1,000× smaller.

The numerical structure of financial statements—balance sheets, income statements, cash flow reports—contains sufficient signal for smaller, focused architectures to excel without narrative context or broad world knowledge.

The study demonstrates that for structured financial analysis, the relationship between model size and performance isn't linear. Once you cross a threshold of capability for numerical reasoning and pattern recognition, additional parameters provide diminishing returns for these specific tasks.

‍

Why Size Doesn't Matter for Structured Financial Data

The Nature of the EPS-Direction Task

Predicting EPS direction from historical financial statements isn't open-ended analysis. The task uses highly structured numerical data—two years of balance sheet data and three years of income statement data—with consistent formatting and standardized accounting principles. Performing it well requires:

- Pattern recognition in numerical sequences: Identifying trends across reporting periods

- Ratio interpretation: Understanding liquidity ratios and efficiency indicators

- Anomaly detection: Spotting unusual relationships that suggest accounting irregularities or business model changes

- Relationship mapping: Understanding how changes in one account impact others

‍

These are fundamentally different from the broad reasoning tasks that motivate massive model architectures. A model doesn't need to understand cultural references or reason about abstract concepts to determine whether a company's current ratio indicates liquidity concerns.

‍

What Small Models Actually Need

A 1.5B parameter model has enough capacity to learn:

- Financial statement structure and taxonomy: The consistent format of GAAP or IFRS reporting
- Mathematical relationships: How assets equal liabilities plus equity, how working capital flows
- Domain-specific patterns: What "normal" looks like across industries and company sizes
- Temporal reasoning: How metrics evolve quarter-over-quarter and year-over-year

‍

What it doesn't need: general conversational ability or the vast factual knowledge encoded in models 100x larger. For this task, a model with no world knowledge or conversational ability still extracted the relevant numerical patterns—general-purpose capability was simply irrelevant to the prediction.

‍

Architecture and Training: Building Your Specialized Model

Implementing a small-scale financial analysis model requires different thinking than deploying a general-purpose LLM:

The Northwestern study used instruction-tuned decoder-only models in the 1.5B range—Qwen2.5 variants and DeepSeek-R1-Distill—suggesting standard causal language model architectures are sufficient for this class of task. Models in the 1-2B parameter range balance capability and efficiency.

The study used no fine-tuning—default model weights, default decoding parameters, loaded directly from Hugging Face. The implication is notable: the models were never trained on financial statements, yet still approached the human analyst baseline through chain-of-thought prompting alone.

Financial data often benefits from specialized tokenization. Normalizing number formats and preserving numerical precision matters more than general text handling.

All experiments ran on a single NVIDIA Quadro RTX 8000 GPU with 48GB of VRAM, confirming that 1.5B models are deployable on commodity hardware without distributed infrastructure.

‍

The Economics of Deployment: Small vs. Large

Cost Analysis at Scale

At production volumes, the cost difference becomes significant. A 1.5B parameter model runs comfortably on a single high-end GPU—the Northwestern study ran all experiments on one NVIDIA Quadro RTX 8000—with inference times measured in milliseconds. Frontier model API calls introduce network latency and per-token costs that compound at production volumes.

At high volumes, self-hosted small model infrastructure costs are a fraction of large model API costs. The gap widens with document length and analysis depth.

Small models deliver sub-second inference. For automated underwriting or live portfolio monitoring, sub-second inference means the model can sit in the critical path of a decision rather than running as an overnight batch job.

‍

When to Choose Small

Specialized small models excel when:

- You have high-volume, repetitive analysis tasks where per-unit costs matter

- Data privacy concerns make external API calls problematic (financial data often can't leave your infrastructure)

- Latency requirements demand on-premise inference

- Task specificity means you don't need general capabilities

- You have sufficient domain expertise to curate training data and validate outputs

‍

Large general-purpose models remain the better choice when the document type is unfamiliar, the analysis is open-ended, or the task requires external knowledge the small model was never trained on.

‍

Implementation Roadmap for Finance Teams

A practical implementation path for finance teams:

1. Define Your Use Case: Start narrow. Pick one analysis task—say, automated liquidity assessment from balance sheets—before expanding scope.

2. Establish Baselines: Test your current process (human analysts or existing tools) against sample data. Define clear performance metrics: accuracy, false positive rates, processing time.

3. Evaluate Existing Small Models: Before training custom models, test open-source instruction-tuned options—Qwen2.5 variants and DeepSeek-R1-Distill performed well in the Northwestern study—against your baseline.

4. Fine-Tune Strategically: If off-the-shelf models underperform, collect 5,000-10,000 examples of your specific analysis task with expert labels. Fine-tune and re-evaluate.

5. Monitor for Drift: Financial markets and accounting standards evolve. Implement ongoing validation to catch when model performance degrades with unfamiliar document structures or market conditions.

‍

Takeaways

For predicting EPS direction from anonymised balance sheet and income statement data, a 1.5B model without fine-tuning came within 4 F1 points of GPT-4. That is a narrow but concrete result. Organizations deploying AI at scale should evaluate task-specific small models before defaulting to frontier LLM APIs—the Northwestern evidence suggests that for this earnings-direction task, 1.5 billion parameters may be sufficient.

‍

Author: Martin Goodson

Martin is a former Oxford University scientific researcher and has led AI research at several organisations. In 2019, he was elected Chair of the Data Science and AI Section of the Royal Statistical Society, the membership group representing professional data scientists in the UK. Martin is the CEO of the multiple award-winning data extraction firm Evolution AI. He also leads the London Machine Learning Meetup, the largest AI & machine learning community in Europe. He is currently a member of the advisory board of the UCL Generative AI Hub.

‍