Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 10 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

T^2-RAGBench: Multi-Modal Financial RAG Evaluation

Updated 8 September 2025
  • T^2-RAGBench is a benchmark designed to evaluate RAG systems on financial documents that combine narrative text and tabular data.
  • It rigorously separates retrieval from numerical reasoning, enabling precise diagnosis of error sources in complex multi-modal settings.
  • Empirical results reveal significant performance gaps, with hybrid retrieval approaches outperforming pure dense methods in realistic scenarios.

T2^2-RAGBench is a benchmark specifically designed for evaluating Retrieval-Augmented Generation (RAG) methods in the challenging context of multi-modal financial documents that combine textual and tabular information. The benchmark targets tasks requiring both accurate retrieval from large corpora of text-and-table documents and complex numerical reasoning, providing a standard for assessing the capabilities of RAG systems in practical, domain-specific scenarios.

1. Objectives and Benchmark Structure

T2^2-RAGBench was constructed to address a critical gap in RAG evaluation methodology. Unlike most QA datasets that operate under “Oracle-context” assumptions—where the exact supporting context is already supplied—T2^2-RAGBench requires that systems locate the correct, relevant context from a large heterogeneous corpus before performing reasoning. This mirrors the actual workflow in production RAG deployments. The main tasks are divided as follows:

  • Retrieval: A retrieval function ff selects the top-nn most relevant context entities {C1,C2,,Cn}\{C_1^*, C_2^*, \ldots, C_n^*\}, each comprising both narrative text and tables, given a question QQ and the corpus C\mathcal{C}.
  • Reasoning/Generation: A generation model MM then consumes the retrieved contexts and the question to produce a numerical answer AA^*.

This structure explicitly separates retrieval and reasoning errors, enabling clean attribution of bottlenecks in end-to-end RAG systems.

2. Dataset Composition and Context Reformulation

T2^2-RAGBench consists of 32,908 question–context–answer (QCA) triples sourced from four established financial QA datasets: FinQA, ConvFinQA, TAT-DQA, and a filtered subset of VQAonBD. The total corpus includes 9,095 real-world financial documents.

Each example in T2^2-RAGBench is carefully reformulated by:

  • Constructing context-independent questions using large-scale LLMs (e.g., Llama 3.3-70B) to ensure there is a single unique answer provided the correct context, disambiguating common ambiguities in context-dependent questions typical of prior datasets.
  • Appending specific metadata (e.g., company name, fiscal year, sector) to the question, increasing its length by approximately 38% on average and tightly localizing the information need within the data.
  • Ensuring that each context is a realistic blend of financial narrative text and tables, with an average context length of 785.8 tokens.

This data preprocessing enforces a rigorous evaluation setting and mitigates confounding variables arising from question ambiguity or under-specified contexts.

3. Evaluation Methodology and Metrics

Evaluation with T2^2-RAGBench is structured in two principal phases: retrieval and numerical reasoning. Each stage is equipped with mathematically formalized metrics:

Retrieval Evaluation

For each sample ii, the “true rank” rir_i is defined as the position kk of the ground-truth context CiC_i in the retrieved list:

ri=min{kCk=Ci}r_i = \min\{k \mid C^*_k = C_i\}

Performance is primarily measured by Mean Reciprocal Rank at kk (MRR@kk):

MRR@k=1Ni=1N1ri1(rik)\operatorname{MRR@k} = \frac{1}{N} \sum_{i=1}^N \frac{1}{r_i} \cdot \mathbf{1}(r_i \leq k)

where 1()\mathbf{1}(\cdot) is the indicator function.

Numerical Reasoning and Answer Matching

Subsequent to retrieval, the reasoning component generates the answer AA^*. For predominantly numeric answers, exact matching is relaxed using the “Number Match” criterion, which tolerates small relative or absolute deviations. Let a=Aa^* = |A^*| and a=Aa = |A| (predicted and gold), the match condition is:

q1<εwithq=aa10round(log10(a/a))\left|q - 1\right| < \varepsilon\quad \text{with}\quad q = \frac{a^*}{a}\cdot10^{-\text{round}(\log_{10}(a^*/a))}

or both values are under a tiny threshold ε\varepsilon, mitigating false negatives due to rounding or unit scale errors.

4. Empirical Findings and Comparative Analysis

Comprehensive benchmarking with state-of-the-art LLMs and retrieval strategies yields several notable observations:

  • Oracle-Context Upper Bound: When the correct supporting context is given, SOTA LLMs commonly achieve Number Match scores exceeding 70%, confirming high intrinsic numerical reasoning on these tasks.
  • End-to-End RAG Performance: In the full RAG scenario (retrieval plus generation), performance degrades significantly. Base-RAG systems exhibit average MRR@3 below 40%, highlighting the difficulty of precisely retrieving heterogeneous text-and-table contexts from large corpora.
  • Retrieval Strategy Ranking: Hybrid BM25, which linearly combines sparse (BM25) and dense (vector-based) retrieval scores, systematically outperforms alternatives including Base-RAG (dense only), reranker-based methods, and approaches leveraging hypothetical document generation (HyDE). The hybrid approach demonstrates superior recall due to the complementary strengths of lexical and semantic similarity.
  • Subdomain Variation: Approaches that summarize or condense long contexts improve retrieval and answer accuracy for text-rich tasks (FinQA, ConvFinQA) but may harm performance where granular tabular evidence is required (VQAonBD).

5. Ablation Studies: Embedding Models and Corpus Size

The benchmark includes ablation studies to isolate key variables affecting RAG performance:

  • Embedding Model Selection: When evaluating open- and closed-source embeddings, the best open model (Multilingual E5-Instruct) achieves approximately 29.4% Recall@1 (R@1) and 38.6% MRR@5. Closed-source options (e.g., OpenAI Text-Embedding-3 Large) perform somewhat better but still struggle in absolute terms.
  • Corpus Scaling Effects: Retrieval accuracy declines sharply as the number of candidate contexts surpasses 3,000. This effect persists across retrieval methods, underscoring the inherent scaling challenge in mixed-modal document corpora.
Retrieval Method R@1 (%) MRR@5 (%) Notes
Hybrid BM25 Highest Highest Sparse + dense fusion
Base-RAG (Dense) Lower Lower Pure dense retrieval
Reranker-based Intermediate Intermediate Post-process reranking
Multilingual E5-inst 29.4 38.6 Best open-source embedding

6. Practical Significance and Availability

T2^2-RAGBench is distinguished by its emphasis on context-independence, modular evaluation, and the realistic challenge of retrieving financial documents containing complex, distributed evidence between text and tables. This benchmark enables:

  • Disentangling errors in retrieval from reasoning, vital for diagnosing performance bottlenecks.
  • Benchmarking document retrievers and generators under a single, standardized protocol, facilitating fair comparison and ablation.
  • Stimulating further research in multi-modal retrieval, robust cross-modal embedding, and advanced fusion strategies for text-and-table data.

All resources, including code and data, are released for public access at https://anonymous.4open.science/r/g4kmu-paper-D5F8/README.md.

7. Relation to Broader RAG Benchmarking and Open Challenges

T2^2-RAGBench complements existing RAG benchmarks such as RAGBench (focused on diverse industry domains with explainable evaluation metrics) (Friel et al., 25 Jun 2024) and mmRAG (emphasizing modular, multi-modal benchmarking including knowledge graphs and tables) (Xu et al., 16 May 2025). In contrast, T2^2-RAGBench is tailored to the text-and-table financial QA domain, foregrounds numerical reasoning, and systematically eliminates Oracle-style context advantages.

Persistent challenges identified include the need for improved cross-modal embeddings capable of reconciling textual and tabular modalities, retrieval algorithms with stable performance as corpus sizes scale, and generation methods that robustly integrate evidence from retrieved heterogeneous contexts. The significant performance gap observed between the Oracle-context and retrieval-dependent settings underscores that retrieval remains the limiting factor in complex, multi-modal RAG systems, motivating continued algorithmic innovation and more granular, context-aware evaluation.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to T$^2$-RAGBench.