How to Extract Structured JSON from SEC Filings Using Agentic Workflows

SECTIONS

What Is an Agentic Workflow?Why SEC Filings Resist Traditional Extraction Choosing Your Stack The Pipeline, Step by Step What This Doesn't Solve Yet

SEC filings are a goldmine of financial data — and a nightmare to parse programmatically. A single 10-K can run 80,000 words across dozens of sections, with tables that span pages and use merged cells, formatting that varies between filers, and section structures that shift from year to year. This tutorial walks through building an agentic pipeline that turns a raw 10-K into validated, structured JSON.

What Is an Agentic Workflow?

An agentic workflow is a process driven by an AI agent that can act autonomously toward a goal — rather than just following a fixed set of pre-programmed rules. You give the agent a goal (e.g., "extract all financial statements from this 10-K into structured JSON"), a set of tools (APIs, parsers, validation schemas), and constraints (output format, retry limits). The agent then decides what steps to take, evaluates intermediate results, adjusts its approach if needed, and continues until it reaches the goal.

A traditional pipeline executes a fixed script — same input, same output, and edge cases require explicit handling. An agentic workflow defines the objective and lets the agent decide how to get there. The agent plans actions dynamically, iterates through observe-act-revise loops, and recovers when something breaks.

SEC filing extraction is a natural fit for this pattern. The inputs vary widely between filers: table layouts and section boundaries vary between companies, and formatting is unpredictable. A rule-based pipeline handles repetitive, predictable structures well, but fails silently on the ambiguity and variation inherent in real-world filings. An agentic pipeline recognises extraction failures, reasons about what went wrong, re-attempts with a different strategy, and flags what it still can't resolve. Each stage — fetch, classify, extract, validate — requires different capabilities, and the correction loop between stages catches errors that a linear pipeline would propagate silently.

Why SEC Filings Resist Traditional Extraction

The core problem is format inconsistency. Filings arrive as HTML-wrapped XBRL, embedded PDFs, or hybrid documents where tagging quality depends entirely on the filer. XBRL helps — but only for financial statements. The SEC does not currently require XBRL tagging for MD&A, risk factors, or most narrative sections, which means the richest qualitative data in a 10-K remains unstructured text.

Regex and rule-based pipelines work until they don't. Section boundaries shift between companies and filing years. Tables use merged cells and inline footnotes that defeat simple HTML parsers, especially when they span page breaks. At scale, every unanticipated edge case becomes a data quality hole.

Choosing Your Stack

Here's an opinionated stack that balances capability against cost.

For ingestion and parsing, use SEC-API or direct EDGAR full-text search to fetch filings. For PDF-heavy filings, marker or docling convert to markdown with higher fidelity than generic PDF parsers — they preserve table structure and section hierarchy that tools like PyPDF lose.

For the LLM backbone, use a capable large language model for extraction tasks and a smaller open-weight model for simpler classification steps to manage costs. Use a model with a long context window (128k+ tokens) for extraction tasks. At current pricing, processing a full 10-K typically runs well under $1 per filing in API costs.

For table-heavy pages, use a multimodal LLM. Render the table region as an image and pass it to a vision-language model, which can interpret spatial layouts that text-only parsers miss. Current benchmarks show that vision LLMs are improving on table recognition tasks, though traditional specialized models still hold an edge on complex structures (Zhou et al., 2024).

For agent orchestration, build your logic using a lightweight orchestration layer — this can be as simple as a Python script that manages task handoff, tool calls, and retry logic. The requirements are sequential task routing, tool integration, and feeding validation errors back into the agent's reasoning loop. A working agentic pipeline can be defined in under 50 lines of orchestration code.

For validation, define Pydantic schemas that enforce your target JSON structure. Schema validation is the single most effective guard against malformed output.

The Pipeline, Step by Step

Step 1: Ingest and Chunk the Filing

Fetch the filing from EDGAR, strip boilerplate headers and footers, then chunk semantically by section rather than by fixed token windows. The critical rule: never split mid-table. A table split across two chunks loses its header context and becomes uninterpretable to the extraction agent.

Step 2: Agent-Driven Section Classification

The classification agent receives each chunk and determines which SEC section it belongs to — Item 1, 1A, 7, 8, and so on. Unlike a rule-based classifier that matches keywords against a fixed lookup table, the agent uses few-shot examples to reason about ambiguous boundaries. When it encounters a combined item or an amended filing (10-K/A) that breaks standard patterns, it can infer the correct classification from context rather than failing silently. The output is a section map — a JSON index that downstream steps consume.

Step 3: Table Extraction with Vision Models

For pages with complex tables, render the table region as an image and pass it to a vision-language model. Instruct the model to output a markdown table, then parse that to JSON. This two-step approach (image to markdown to JSON) gives the model a structured intermediate representation to work from, reducing the chance of layout-related errors in the final output. For simpler HTML-based tables, text extraction is faster and cheaper — use vision models only where text-based methods fail.

Step 4: Schema Validation and Self-Correction

Schema validation with self-correction is the most impactful step in this pipeline. Define Pydantic models for your target structures — IncomeStatement, RiskFactor, RevenueSegment — and validate the LLM's output against them. When validation fails, the agent receives the specific validation errors, reasons about what went wrong, and re-attempts extraction with an adjusted approach. The agent adapts on its own — iterating through observe-act-revise cycles until the output validates. Cap retries at two: if the agent can't produce valid output after two correction passes, flag the field for human review.

Recent work validates this pattern. The PARSE framework (Shrimal et al., 2025) combines schema optimization with reflection-based extraction and reports a 92% reduction in extraction errors within the first retry, with extraction accuracy improvements of up to 64.7% over baselines on web data extraction benchmarks.

Step 5: Assemble and Handle Edge Cases

The final output is a JSON document combining metadata, classified sections, and extracted financials. Build in explicit handling for edge cases: foreign private issuers filing 20-Fs, amended filings (10-K/A), non-standard fiscal years, and multi-segment filers. Log a confidence score per extracted field — it's the cheapest way to prioritise human review where it matters most.

What This Doesn't Solve Yet

This pipeline handles single-filing extraction. Cross-filing temporal analysis — comparing revenue segments year-over-year across multiple 10-Ks — requires an additional reconciliation layer. Real-time streaming of new filings is an engineering problem, not an extraction one. And nothing here constitutes legal or compliance advice: this is a data extraction tool, not a regulatory opinion.

The self-correction loop is what makes this work at scale: a filing that defeats a static parser gets a second pass with an adjusted extraction strategy, rather than silently producing garbage.

Interested in fast, accurate data extraction from financial statements without the hassle? Financial Statements AI has everything you need. Sign up here for a free trial.

Author: Martin Goodson is a former Oxford University scientific researcher and has led AI research at several organisations. He is a member of the advisory group for the University College London generative AI Hub. In 2019, he was elected Chair of the Data Science and AI Section of the Royal Statistical Society, the membership group representing professional data scientists in the UK. Martin is the CEO of the multiple award-winning data extraction firm Evolution AI. He also leads the London Machine Learning Meetup, the largest AI & machine learning community in Europe.