T^2-RAGBench: Multi-Modal Financial RAG Evaluation
- T^2-RAGBench is a benchmark designed to evaluate RAG systems on financial documents that combine narrative text and tabular data.
- It rigorously separates retrieval from numerical reasoning, enabling precise diagnosis of error sources in complex multi-modal settings.
- Empirical results reveal significant performance gaps, with hybrid retrieval approaches outperforming pure dense methods in realistic scenarios.
T-RAGBench is a benchmark specifically designed for evaluating Retrieval-Augmented Generation (RAG) methods in the challenging context of multi-modal financial documents that combine textual and tabular information. The benchmark targets tasks requiring both accurate retrieval from large corpora of text-and-table documents and complex numerical reasoning, providing a standard for assessing the capabilities of RAG systems in practical, domain-specific scenarios.
1. Objectives and Benchmark Structure
T-RAGBench was constructed to address a critical gap in RAG evaluation methodology. Unlike most QA datasets that operate under “Oracle-context” assumptions—where the exact supporting context is already supplied—T-RAGBench requires that systems locate the correct, relevant context from a large heterogeneous corpus before performing reasoning. This mirrors the actual workflow in production RAG deployments. The main tasks are divided as follows:
- Retrieval: A retrieval function selects the top- most relevant context entities , each comprising both narrative text and tables, given a question and the corpus .
- Reasoning/Generation: A generation model then consumes the retrieved contexts and the question to produce a numerical answer .
This structure explicitly separates retrieval and reasoning errors, enabling clean attribution of bottlenecks in end-to-end RAG systems.
2. Dataset Composition and Context Reformulation
T-RAGBench consists of 32,908 question–context–answer (QCA) triples sourced from four established financial QA datasets: FinQA, ConvFinQA, TAT-DQA, and a filtered subset of VQAonBD. The total corpus includes 9,095 real-world financial documents.
Each example in T-RAGBench is carefully reformulated by:
- Constructing context-independent questions using large-scale LLMs (e.g., Llama 3.3-70B) to ensure there is a single unique answer provided the correct context, disambiguating common ambiguities in context-dependent questions typical of prior datasets.
- Appending specific metadata (e.g., company name, fiscal year, sector) to the question, increasing its length by approximately 38% on average and tightly localizing the information need within the data.
- Ensuring that each context is a realistic blend of financial narrative text and tables, with an average context length of 785.8 tokens.
This data preprocessing enforces a rigorous evaluation setting and mitigates confounding variables arising from question ambiguity or under-specified contexts.
3. Evaluation Methodology and Metrics
Evaluation with T-RAGBench is structured in two principal phases: retrieval and numerical reasoning. Each stage is equipped with mathematically formalized metrics:
Retrieval Evaluation
For each sample , the “true rank” is defined as the position of the ground-truth context in the retrieved list:
Performance is primarily measured by Mean Reciprocal Rank at (MRR@):
where is the indicator function.
Numerical Reasoning and Answer Matching
Subsequent to retrieval, the reasoning component generates the answer . For predominantly numeric answers, exact matching is relaxed using the “Number Match” criterion, which tolerates small relative or absolute deviations. Let and (predicted and gold), the match condition is:
or both values are under a tiny threshold , mitigating false negatives due to rounding or unit scale errors.
4. Empirical Findings and Comparative Analysis
Comprehensive benchmarking with state-of-the-art LLMs and retrieval strategies yields several notable observations:
- Oracle-Context Upper Bound: When the correct supporting context is given, SOTA LLMs commonly achieve Number Match scores exceeding 70%, confirming high intrinsic numerical reasoning on these tasks.
- End-to-End RAG Performance: In the full RAG scenario (retrieval plus generation), performance degrades significantly. Base-RAG systems exhibit average MRR@3 below 40%, highlighting the difficulty of precisely retrieving heterogeneous text-and-table contexts from large corpora.
- Retrieval Strategy Ranking: Hybrid BM25, which linearly combines sparse (BM25) and dense (vector-based) retrieval scores, systematically outperforms alternatives including Base-RAG (dense only), reranker-based methods, and approaches leveraging hypothetical document generation (HyDE). The hybrid approach demonstrates superior recall due to the complementary strengths of lexical and semantic similarity.
- Subdomain Variation: Approaches that summarize or condense long contexts improve retrieval and answer accuracy for text-rich tasks (FinQA, ConvFinQA) but may harm performance where granular tabular evidence is required (VQAonBD).
5. Ablation Studies: Embedding Models and Corpus Size
The benchmark includes ablation studies to isolate key variables affecting RAG performance:
- Embedding Model Selection: When evaluating open- and closed-source embeddings, the best open model (Multilingual E5-Instruct) achieves approximately 29.4% Recall@1 (R@1) and 38.6% MRR@5. Closed-source options (e.g., OpenAI Text-Embedding-3 Large) perform somewhat better but still struggle in absolute terms.
- Corpus Scaling Effects: Retrieval accuracy declines sharply as the number of candidate contexts surpasses 3,000. This effect persists across retrieval methods, underscoring the inherent scaling challenge in mixed-modal document corpora.
Retrieval Method | R@1 (%) | MRR@5 (%) | Notes |
---|---|---|---|
Hybrid BM25 | Highest | Highest | Sparse + dense fusion |
Base-RAG (Dense) | Lower | Lower | Pure dense retrieval |
Reranker-based | Intermediate | Intermediate | Post-process reranking |
Multilingual E5-inst | 29.4 | 38.6 | Best open-source embedding |
6. Practical Significance and Availability
T-RAGBench is distinguished by its emphasis on context-independence, modular evaluation, and the realistic challenge of retrieving financial documents containing complex, distributed evidence between text and tables. This benchmark enables:
- Disentangling errors in retrieval from reasoning, vital for diagnosing performance bottlenecks.
- Benchmarking document retrievers and generators under a single, standardized protocol, facilitating fair comparison and ablation.
- Stimulating further research in multi-modal retrieval, robust cross-modal embedding, and advanced fusion strategies for text-and-table data.
All resources, including code and data, are released for public access at https://anonymous.4open.science/r/g4kmu-paper-D5F8/README.md.
7. Relation to Broader RAG Benchmarking and Open Challenges
T-RAGBench complements existing RAG benchmarks such as RAGBench (focused on diverse industry domains with explainable evaluation metrics) (Friel et al., 25 Jun 2024) and mmRAG (emphasizing modular, multi-modal benchmarking including knowledge graphs and tables) (Xu et al., 16 May 2025). In contrast, T-RAGBench is tailored to the text-and-table financial QA domain, foregrounds numerical reasoning, and systematically eliminates Oracle-style context advantages.
Persistent challenges identified include the need for improved cross-modal embeddings capable of reconciling textual and tabular modalities, retrieval algorithms with stable performance as corpus sizes scale, and generation methods that robustly integrate evidence from retrieved heterogeneous contexts. The significant performance gap observed between the Oracle-context and retrieval-dependent settings underscores that retrieval remains the limiting factor in complex, multi-modal RAG systems, motivating continued algorithmic innovation and more granular, context-aware evaluation.