FinanceBench: Financial QA Benchmark

Updated 17 November 2025

FinanceBench is a comprehensive dataset defining financial question answering using over 10,000 verified QA triplets from U.S. public filings.
It integrates detailed annotations across extraction, numerical, and logical reasoning tasks to assess LLM performance with real evidence.
The benchmark highlights LLM challenges including hallucinations and high refusal rates, emphasizing the need for robust retrieval systems and precise prompt strategies.

FinanceBench is an open test suite designed to evaluate LLMs on financial question answering (QA) tasks that require retrieving, interpreting, and reasoning over public company filings. Its core aim is to serve as a high-fidelity diagnostic for LLM capabilities in domains where accuracy, citation to primary evidence, and robustness to hallucination are essential. By focusing on ecologically valid QA scenarios rooted in real disclosures from listed firms, FinanceBench sets a practical minimum performance standard for LLM deployment in enterprise finance settings (Islam et al., 2023).

1. Dataset Scope and Structure

FinanceBench comprises 10,231 question–answer–evidence triplets centered on 40 U.S.-listed companies spanning nine of eleven GICS sectors, exclusively using public filings as sources. These include:

270 10-Ks (75%)
27 10-Qs
29 8-Ks
29 earnings releases
5 annual reports

Questions are free-text, directly answerable from specific filings, and always reference a precise evidence string—extracted verbatim from the relevant PDF filing and indexed by source page number. This design enforces traceability and answer accountability.

The benchmark covers a diverse set of company sizes (market capitalizations from ~$1.8B to over$2.7T), ensuring broad applicability across the U.S. equity landscape.

2. Question Taxonomy and Generation Methods

FinanceBench questions fall into three major categories:

Category	Questions	Description
Information extraction	2,493	Direct value or span lookups
Numerical reasoning	5,897	Ratios, arithmetic, comparisons
Logical reasoning	518	Qualitative or multi-step logic

Information extraction: Required to retrieve a specific unit or text span from the document (e.g., “What is net income in FY2022?”).
Numerical reasoning: Require calculation or synthesis (e.g., margins, multi-year growth rates), further subtyped into "single-statement" and "multi-statement" questions depending on whether cross-statement data blending is called for.
Logical reasoning: Demand qualitative, judgmental, or multi-hop inference.

Generation protocols include:

25 domain-relevant standardized question templates, posed for 37 companies (925 questions)
1,323 analyst-authored, company-specific novel questions
7,983 programmatically templated metric-driven questions derived from 18 core metrics across 8 years/statement sections

Each entry encodes (i) the full question, (ii) a gold answer (numeric/textual/boolean), and (iii) a supporting evidence string with location metadata. Questions are structured to be clear-cut and unambiguous, with explicit instructions when necessary (e.g., “Answer in USD millions”).

3. Annotation, Quality Assurance, and Evidence Protocol

Data annotation was performed by 20 finance-qualified annotators, following documented onboarding and screening tests. Key quality control measures:

Specialist annotators extracted base metrics at scale (>2,300 extractions by individuals)
Senior analyst sampled ~10% of annotated items for review, with regular calibration
Correction and feedback cycles ensured conformance and reduced annotation drift
Evidence strings are always extracted verbatim from source PDFs, providing precise citation and reducing ambiguity

No inter-annotator agreement statistics (e.g., Cohen's κ) are reported, but spot checks and multi-pass review are performed across 10% of domain-relevant and novel entries.

4. Data Access, Format, and Usage

FinanceBench is released as a test-suite benchmark—i.e., not for direct model training, but as an evaluation bed. The full dataset and a 150-case, human-annotated evaluation subset are available under a permissive MIT-style license at https://github.com/patronus-ai/financebench.

Each entry includes: question, gold answer, evidence string, and page reference
File format: structured JSON records with supporting documentation ("datasheet")
Splits: The dataset is structured as a testbed; a balanced human-evaluation sample of 150 questions is provided spanning all categories and generation modes

5. Evaluation Protocols and Model Baselines

Evaluation is conducted using accuracy as the primary metric: $\text{Accuracy} = \frac{\text{Number of correct model responses}}{\text{Total number of questions}} \times 100\%$ Incorrect answer rate and failure-to-answer (“refusals”) are tracked.

No extraction-style span-F1 or partial-overlap metrics are used; the evaluation considers only full-answer correctness.

Baseline configurations include:

Closed-book LLMs (no retrieval)
Oracle (full evidence pages provided directly)
Single-document vector store per filing (Chroma + OpenAI embeddings)
Shared vector store over all filings (less practical for latency and context reasons)
Long-context models (e.g., GPT-4-Turbo 128k, Claude2 100k context windows)

Performance across these baselines reveals:

Configuration	Correct	Incorrect	Failures
Closed-book (GPT-4T)	9%	3%	88%
Shared vector store	19%	13%	68%
Per-file store	50%	11%	39%
Long-context (GPT-4T)	79%	17%	4%
Oracle (GPT-4T)	85%	15%	0%

Full results on 8 “realistic” configurations show overall: Correct 47%, Incorrect 26%, Failures 27%.

6. Model Limitations and Failure Analysis

FinanceBench’s design exposes core weaknesses in current LLM capabilities:

Hallucinations: Models frequently generate plausible but factually incorrect or unsubstantiated answers, even when prompted to provide evidence from filings
High refusal rates: Most model configurations simply abstain (“refuse to answer”) when retrieval fails, especially in closed-book or shared-vector-store settings
Latency and cost bottlenecks: Configurations relying on per-document or long-context retrieval are computationally prohibitive for enterprise environments and cannot scale to larger filings
Prompt-ordering impact: Placing context before the question in “long context” prompts can swing GPT-4-Turbo accuracy from 25% to 78%, underscoring acute context management sensitivity

No models tested demonstrate full robustness against these failure modes—factually incorrect answers and high nonanswer rates undermine enterprise suitability.

7. Recommendations and Broader Implications

FinanceBench enables quantitative benchmarking and diagnosis of LLM QA capabilities in financial domains. Deployment recommendations include:

Always augment LLMs with a reliable, low-latency retrieval system (vector store, chunked search, or long-context if feasible)
Structure prompts with “context first” placement to minimize information loss
Set low generation temperature (e.g., 0.01) to suppress randomness in critical applications
Institute refusal detection protocols and human-in-the-loop review for high-stakes cases
Model triangulation: Use outputs from multiple models and retrieval strategies to cross-validate answers

FinanceBench limitations include its exclusive focus on single-firm, single-turn QA over public filings (i.e., not supporting multi-firm/multi-turn dialogues or private data). Complex, multi-step analytical or generative tasks are not covered.

A plausible implication is that while FinanceBench sets a credible baseline for factual LLM performance in standard financial QA, further benchmarks will be required to cover more open-ended, interleaved, or multi-filing real-world use cases. The inclusion of the dataset in broader financial NLP benchmarks (e.g., FinMTEB (Tang et al., 16 Feb 2025)) reflects its growing influence on the evaluation protocols for retrieval, classification, and embedding quality in finance-specific language processing pipelines.

PDF Markdown Chat (Pro)

References (2)

FinanceBench: A New Benchmark for Financial Question Answering (2023)

FinMTEB: Finance Massive Text Embedding Benchmark (2025)

Follow Topic

Get notified by email when new papers are published related to FinanceBench Dataset.