FinanceBench: Financial QA Benchmark
- FinanceBench is a comprehensive dataset defining financial question answering using over 10,000 verified QA triplets from U.S. public filings.
- It integrates detailed annotations across extraction, numerical, and logical reasoning tasks to assess LLM performance with real evidence.
- The benchmark highlights LLM challenges including hallucinations and high refusal rates, emphasizing the need for robust retrieval systems and precise prompt strategies.
FinanceBench is an open test suite designed to evaluate LLMs on financial question answering (QA) tasks that require retrieving, interpreting, and reasoning over public company filings. Its core aim is to serve as a high-fidelity diagnostic for LLM capabilities in domains where accuracy, citation to primary evidence, and robustness to hallucination are essential. By focusing on ecologically valid QA scenarios rooted in real disclosures from listed firms, FinanceBench sets a practical minimum performance standard for LLM deployment in enterprise finance settings (Islam et al., 2023).
1. Dataset Scope and Structure
FinanceBench comprises 10,231 question–answer–evidence triplets centered on 40 U.S.-listed companies spanning nine of eleven GICS sectors, exclusively using public filings as sources. These include:
- 270 10-Ks (75%)
- 27 10-Qs
- 29 8-Ks
- 29 earnings releases
- 5 annual reports
Questions are free-text, directly answerable from specific filings, and always reference a precise evidence string—extracted verbatim from the relevant PDF filing and indexed by source page number. This design enforces traceability and answer accountability.
The benchmark covers a diverse set of company sizes (market capitalizations from ~$1.8B to over$2.7T), ensuring broad applicability across the U.S. equity landscape.
2. Question Taxonomy and Generation Methods
FinanceBench questions fall into three major categories:
| Category | Questions | Description |
|---|---|---|
| Information extraction | 2,493 | Direct value or span lookups |
| Numerical reasoning | 5,897 | Ratios, arithmetic, comparisons |
| Logical reasoning | 518 | Qualitative or multi-step logic |
- Information extraction: Required to retrieve a specific unit or text span from the document (e.g., “What is net income in FY2022?”).
- Numerical reasoning: Require calculation or synthesis (e.g., margins, multi-year growth rates), further subtyped into "single-statement" and "multi-statement" questions depending on whether cross-statement data blending is called for.
- Logical reasoning: Demand qualitative, judgmental, or multi-hop inference.
Generation protocols include:
- 25 domain-relevant standardized question templates, posed for 37 companies (925 questions)
- 1,323 analyst-authored, company-specific novel questions
- 7,983 programmatically templated metric-driven questions derived from 18 core metrics across 8 years/statement sections
Each entry encodes (i) the full question, (ii) a gold answer (numeric/textual/boolean), and (iii) a supporting evidence string with location metadata. Questions are structured to be clear-cut and unambiguous, with explicit instructions when necessary (e.g., “Answer in USD millions”).
3. Annotation, Quality Assurance, and Evidence Protocol
Data annotation was performed by 20 finance-qualified annotators, following documented onboarding and screening tests. Key quality control measures:
- Specialist annotators extracted base metrics at scale (>2,300 extractions by individuals)
- Senior analyst sampled ~10% of annotated items for review, with regular calibration
- Correction and feedback cycles ensured conformance and reduced annotation drift
- Evidence strings are always extracted verbatim from source PDFs, providing precise citation and reducing ambiguity
No inter-annotator agreement statistics (e.g., Cohen's κ) are reported, but spot checks and multi-pass review are performed across 10% of domain-relevant and novel entries.
4. Data Access, Format, and Usage
FinanceBench is released as a test-suite benchmark—i.e., not for direct model training, but as an evaluation bed. The full dataset and a 150-case, human-annotated evaluation subset are available under a permissive MIT-style license at https://github.com/patronus-ai/financebench.
- Each entry includes: question, gold answer, evidence string, and page reference
- File format: structured JSON records with supporting documentation ("datasheet")
- Splits: The dataset is structured as a testbed; a balanced human-evaluation sample of 150 questions is provided spanning all categories and generation modes
5. Evaluation Protocols and Model Baselines
Evaluation is conducted using accuracy as the primary metric: Incorrect answer rate and failure-to-answer (“refusals”) are tracked.
No extraction-style span-F1 or partial-overlap metrics are used; the evaluation considers only full-answer correctness.
Baseline configurations include:
- Closed-book LLMs (no retrieval)
- Oracle (full evidence pages provided directly)
- Single-document vector store per filing (Chroma + OpenAI embeddings)
- Shared vector store over all filings (less practical for latency and context reasons)
- Long-context models (e.g., GPT-4-Turbo 128k, Claude2 100k context windows)
Performance across these baselines reveals:
| Configuration | Correct | Incorrect | Failures |
|---|---|---|---|
| Closed-book (GPT-4T) | 9% | 3% | 88% |
| Shared vector store | 19% | 13% | 68% |
| Per-file store | 50% | 11% | 39% |
| Long-context (GPT-4T) | 79% | 17% | 4% |
| Oracle (GPT-4T) | 85% | 15% | 0% |
Full results on 8 “realistic” configurations show overall: Correct 47%, Incorrect 26%, Failures 27%.
6. Model Limitations and Failure Analysis
FinanceBench’s design exposes core weaknesses in current LLM capabilities:
- Hallucinations: Models frequently generate plausible but factually incorrect or unsubstantiated answers, even when prompted to provide evidence from filings
- High refusal rates: Most model configurations simply abstain (“refuse to answer”) when retrieval fails, especially in closed-book or shared-vector-store settings
- Latency and cost bottlenecks: Configurations relying on per-document or long-context retrieval are computationally prohibitive for enterprise environments and cannot scale to larger filings
- Prompt-ordering impact: Placing context before the question in “long context” prompts can swing GPT-4-Turbo accuracy from 25% to 78%, underscoring acute context management sensitivity
No models tested demonstrate full robustness against these failure modes—factually incorrect answers and high nonanswer rates undermine enterprise suitability.
7. Recommendations and Broader Implications
FinanceBench enables quantitative benchmarking and diagnosis of LLM QA capabilities in financial domains. Deployment recommendations include:
- Always augment LLMs with a reliable, low-latency retrieval system (vector store, chunked search, or long-context if feasible)
- Structure prompts with “context first” placement to minimize information loss
- Set low generation temperature (e.g., 0.01) to suppress randomness in critical applications
- Institute refusal detection protocols and human-in-the-loop review for high-stakes cases
- Model triangulation: Use outputs from multiple models and retrieval strategies to cross-validate answers
FinanceBench limitations include its exclusive focus on single-firm, single-turn QA over public filings (i.e., not supporting multi-firm/multi-turn dialogues or private data). Complex, multi-step analytical or generative tasks are not covered.
A plausible implication is that while FinanceBench sets a credible baseline for factual LLM performance in standard financial QA, further benchmarks will be required to cover more open-ended, interleaved, or multi-filing real-world use cases. The inclusion of the dataset in broader financial NLP benchmarks (e.g., FinMTEB (Tang et al., 16 Feb 2025)) reflects its growing influence on the evaluation protocols for retrieval, classification, and embedding quality in finance-specific language processing pipelines.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free