FinanceBench: Financial QA Benchmark
- FinanceBench is a large-scale, ecologically valid QA benchmark comprising 10,231 expert-annotated question-answer-evidence triplets from U.S. SEC filings.
- It features diverse question types—information lookup, numerical reasoning, and logical inference—mirroring real-world financial analysis workflows.
- The benchmark drives advances in retrieval strategies and document intelligence through metadata augmentation, embedding evaluations, and iterative reasoning techniques.
FinanceBench is a large-scale, ecologically valid benchmark for evaluating the question-answering capabilities of LLMs and retrieval-augmented generation (RAG) systems over real-world financial documents. Introduced by Islam et al. (2023), FinanceBench comprises expert-annotated questions, gold-standard answers, and evidence strings linked to U.S. SEC filings. It has emerged as a principal testbed for both open-domain and domain-adapted financial QA models, driving methodological advances in information retrieval, LLM instruction tuning, neurosymbolic hybrid architectures, and vision-enhanced document intelligence systems.
1. Dataset Definition, Construction, and Scope
FinanceBench was designed to provide a rigorous, transparent, and challenging “open-book” QA standard for financial documents (Islam et al., 2023). Its construction involved the curation and annotation of 10,231 question/answer/evidence triplets spanning 361 public company filings (10-Ks, 10-Qs, 8-Ks, earnings releases) from 2015–2023, covering 40 major U.S.-listed firms. Each entry includes:
- A human-generated question (e.g., “What is Boeing’s FY2022 cost of goods sold (in USD millions)?”)
- A gold-standard answer string
- An evidence span—typically a precise sentence or two from the filing, with an associated page number
- (Optionally) a free-text justification for chain-of-thought or multi-step calculation
Question types are drawn from authentic financial analysis workflows:
- Information extraction (lookup)
- Numerical reasoning (calculating financial/reporting metrics)
- Logical inference (fact synthesis, trend detection, or compliance assessment)
The question taxonomy includes: domain-relevant hand-crafted questions, realistic analyst-generated items, and metrics-generated queries reflecting both extractive and multi-step arithmetic patterns. Approximately 28% of questions demand only simple data extraction, 66% require numerical calculation, and 6% involve explicit logical reasoning. Around 85% of numerical questions are directly answerable from a single statement, while the remainder require aggregation across multiple report sections.
A commonly used open-source mini-benchmark is the 150-case subset, which is stratified by question type (50 each: domain-relevant, novel generated, metrics-generated) and is heavily referenced in contemporary QA research (Islam et al., 2023, Anderson et al., 2024, Dadopoulos et al., 28 Oct 2025).
2. Evaluation Protocols, Metrics, and Judging Paradigms
FinanceBench adopts a rigorous evaluation protocol. For most settings, the QA system is provided either with the full relevant document, retrieved page or chunk, or, in the oracle condition, the gold evidence page (Islam et al., 2023). The output is compared against the gold-standard answer using deterministic or LLM-graded criteria.
Primary metrics:
- Accuracy: For N_total questions,
A response is “correct” only if it matches the gold answer including units and values, with minor tolerance for rounding.
- For IR/RAG settings:
- Precision@k and Recall@k over retrieved chunks/pages.
- MRR@k: Mean reciprocal rank of the first relevant chunk.
- DCG@k, NDCG@k: Discounted/Normalized Cumulative Gain for ranking retrieval candidates.
Advanced Judging:
- Several studies (notably those with LLM-based judges) employ a 1–10 scale, where three independent judge scores per answer are averaged and reported as percentages (Rajani et al., 2024).
Other metrics in research practice:
- Consistency (frequency of repeatable/correct outputs) (Luong et al., 2024)
- Faithfulness and Hallucination rates (claims supported by retrieved evidence) (Dadopoulos et al., 28 Oct 2025)
- ROUGE-L and cosine similarity (semantic matching, especially in fine-tuning studies or RAG-enhanced QA) (Zhang et al., 2024)
3. Retrieval-Augmented Generation (RAG) and Embedding Evaluations
FinanceBench is a core benchmark in the evaluation of RAG pipelines and finance-adapted embedding models. Its challenging nature stems from the sparsity and cross-referencing of supporting evidence in lengthy, multi-modal filings:
- Vector store evaluations: Systems are benchmarked under "shared vector store" (single global index for all filings) and "single vector store" (per-document index) modalities (Islam et al., 2023, Anderson et al., 2024).
- Embedding models: OpenAI ada-002, text-embedding-3-small/large, BGE, Multilingual-E5, and domain-finetuned bi-encoders are commonly evaluated (Anderson et al., 2024, Brenner et al., 8 Dec 2025).
- Domain-adapted embeddings: BAM (finetuned on 14.3M finance query-passage pairs), LLM-distilled bi-encoders, and hybrid retrieval (dense + sparse or metadata-augmented) approaches all show substantial accuracy or recall boosts—e.g., BAM yields an 8 percentage-point lift over ada-002 (55% vs. 47%) (Anderson et al., 2024), and LLM-distilled retrievers increase NDCG in three of four FinanceBench document classes (Brenner et al., 8 Dec 2025).
- Pre-/post-retrieval enhancements: Query rewriting, chunk metadata augmentation, pre-filtering using LLM-generated file summaries, and post-retrieval re-ranking (including cross-encoder and metadata rerankers) are shown to improve both retrieval fidelity and answer support (Dadopoulos et al., 28 Oct 2025).
A comparison table of representative FinanceBench IR/RAG results:
| Model/Method | Accuracy (%) | NDCG | Comments |
|---|---|---|---|
| OpenAI ada-002 (shared store) | 47 | — | Baseline dual-encoder |
| BAM Embeddings (domain-adapted) | 55 | — | +8 pp over ada-002 |
| Metadata-RAG, best rerank+metadata | 44.4 (F1) | — | +35% relative over naive |
| LLM-distilled retriever (NDCG) | — | 0.60 | best on 10-Q; +0.08 over base |
4. Advanced Architectures: LLM Fine-tuning, Instruction Tuning, and Hybrid Methods
Recent work demonstrates that end-to-end QA accuracy on FinanceBench is most significantly advanced by targeted fine-tuning of both the retrieval mechanism and the LLM generator (Nguyen et al., 2024):
- Supervised fine-tuning: Even with 10–50 example pairs from FinanceBench, both LLM and embedding model fine-tuning boosts ROUGE-L and semantic similarity substantially (e.g., LLaMA-2 ROUGE-L: 0.35 → 0.68, cosine: 0.52 → 0.80) (Zhang et al., 2024).
- Instruction tuning vs. full model fine-tuning: Domain-specific, instruction-tuned LLMs (e.g., KodeX-70Bv0.1) now exceed GPT-4 on FinanceBench in LLM-as-a-judge scoring (79.7% vs. 77.8%) (Rajani et al., 2024).
- Neurosymbolic/hybrid approaches: DANA (Luong et al., 2024) uses domain-aware neurosymbolic task plans, deterministic symbolic operators, and a Knowledge Store to achieve 94.3% overall accuracy and near-perfect consistency across all FinanceBench question types. DANA constructs explicit hierarchical task plans for extraction, computation, and judgment, eliminating hallucinations and arithmetic slip-ups endemic to pure neural approaches.
- Iterative reasoning modules (e.g., OODA): When layered on top of fully fine-tuned RAG, iterative reasoning yields dramatic accuracy gains—85% vs. 37–59% for best non-iterative setups (Nguyen et al., 2024).
5. Document Processing Pipelines: Metadata, Multimodal, and Vision-Enhanced Methods
State-of-the-art document QA systems for FinanceBench increasingly exploit rich pre-processing, metadata fusion, and direct vision-based indexing:
- Metadata-augmented chunking: Embedding not just text, but also LLM-generated metadata (entities, clusters, nuggets) into chunk representations improves retrieval F1 (e.g., 44.4% vs. 32.9% baseline) (Dadopoulos et al., 28 Oct 2025).
- Hybrid and multi-path retrieval: Combining BM25, dense (BGE-M3), metadata, and hypothetical-answer-based paths achieves 92.51% recall at top-K on comparable expert QA (Wang et al., 20 Apr 2025).
- Vision-enhanced retrieval: VisionRAG (Roy et al., 26 Nov 2025) indexes whole page images using a pyramid of semantic vectors—page-level, section header, atomic fact, and visual hotspot artifacts—constructed via a vision-LLM (e.g., GPT-4o). Fusing retrieval signals via Reciprocal Rank Fusion, VisionRAG achieves 80.5% Accuracy@10, surpassing text-based vector store baselines by 30+ percentage points.
| System | Accuracy@10 (%) | Comments |
|---|---|---|
| GPT-4 Turbo (shared store) | 19.0 | Text vectorstore |
| GPT-4 Turbo (single global) | 50.0 | Text vectorstore, per-filing index |
| Claude 2 (long context) | 76.0 | Full filing in prompt |
| GPT-4 Turbo (oracle pages) | 85.0 | Human-annotated pages |
| VisionRAG (pyramid, OCR-less) | 80.5 | Lean index, vision artifacts, 10 pages/K=10 |
6. Impact, Limitations, and Future Research Directions
FinanceBench is now established as the definitive QA benchmark for retrieval and question answering on SEC filings. Studies reveal:
- Domain-adapted retrieval is critical: Most performance gains in FinanceBench QA are due to specialized retriever/embedding models (Anderson et al., 2024, Nguyen et al., 2024), with further boosts from metadata/contextual enrichment (Dadopoulos et al., 28 Oct 2025).
- Long-context LLMs vs. vector/RAG hybrids: Feeding entire filings enables higher accuracy (e.g., GPT-4 79%), but with impractical latency and token costs for enterprise scale; hybrid RAG architectures are preferred for real-world deployment (Islam et al., 2023).
- Hallucinations and error propagation remain central challenges: Even SOTA LLMs exhibit logical, arithmetic, and retrieval slip-ups, with rare multi-hop or cross-file evidence needs still causing brittleness (Luong et al., 2024, Wang et al., 20 Apr 2025).
- Open, domain-anchored evaluation encourages methodological rigor: The availability of transparent annotation, precise evaluation, and challenge in evidence-centered retrieval fosters community progress and reproducibility (Islam et al., 2023, Dadopoulos et al., 28 Oct 2025).
Current limitations include:
- Sensitivity to chunking and retrieval granularity
- Need for continual retriever adaptation as new jargon/reporting standards emerge
- Multi-modal and table/figure evidence support remains incomplete in most text-centric systems
Active research addresses:
- Improved cross-modal and co-reference handling
- Online (incremental) retriever adaptation for continually updated filings
- More efficient cross-encoder and distillation methods for retrieval relevance
- Advanced multi-turn and agentic reasoning scaffolds integrated with retrieval and verification
FinanceBench thus remains central in driving both methodological and practical advances in production-grade financial QA, retrieval, and document intelligence (Islam et al., 2023, Nguyen et al., 2024, Luong et al., 2024, Anderson et al., 2024, Dadopoulos et al., 28 Oct 2025, Roy et al., 26 Nov 2025, Wang et al., 20 Apr 2025).