TQA-Bench: Multi-Table QA Benchmark

Updated 25 December 2025

TQA-Bench is a benchmark suite that systematically evaluates large language models on complex multi-table question answering tasks.
It employs real-world datasets and innovative sampling methodologies to overcome the limitations of single-table evaluations.
The benchmark integrates symbolic extensions for reliable answer verification, highlighting performance gaps in compositional reasoning over long contexts.

TQA-Bench is a benchmark and evaluation suite created to systematically assess the capabilities of LLMs in handling multi-table question answering (TQA) over relational databases. It is specifically designed to address limitations of prior table-QA benchmarks, which focus primarily on single-table settings and fail to capture the complex reasoning required for realistic, multi-table contexts in domains such as finance, healthcare, and e-commerce. TQA-Bench introduces datasets, methodology, and evaluation protocols that enable robust, fine-grained measurement of LLM performance on tasks that require scalable context handling, multi-step joins, aggregation, and symbolic reasoning (Qiu et al., 2024).

1. Motivation: The Need for Multi-Table QA Benchmarks

Prevailing Table-QA benchmarks—such as WikiTableQuestions and FeTaQA—feature single tables of modest size, which do not require reasoning over heterogenous schemas, multiple joins, or large-scale contexts. These benchmarks are inadequate for real-world scenarios, where operational databases may contain millions of rows distributed across 2–6 interconnected tables linked by foreign keys. Additionally, much of the existing content in widely used benchmarks is already exposed to LLMs through pretraining, confounding generalization analysis due to memorization or data leakage. Furthermore, fixed question sets in prior work tend to promote answer-pattern memorization rather than genuine compositional reasoning.

TQA-Bench systematically addresses these deficits by incorporating:

Multiple, heterogenous tables per instance
Large-scale, sampled contexts ranging from 8K to 64K tokens
Complex relational operators (multi-way joins, aggregation, correlation)
New symbolic extensions for precise, robust evaluation

2. Benchmark Construction and Data Sources

TQA-Bench leverages public, real-world datasets including:

WorldBank: Two-table and large biodiversity datasets (up to 6×10⁵ rows)
DataGov: Water Quality and Food Facility Inspection (up to 1.6×10⁶ rows per table, 2–4 tables each)
BIRD: Seven enterprise-style, multi-table databases (2–6 tables) with referential integrity, derived from Text2SQL benchmarks after removing cycles and broken foreign key constraints

Sampling utilizes a topological sort of tables according to their foreign key relationships, followed by probabilistic row sampling rooted at source tables. The context size (token count after Markdown serialization) is precisely controlled via binary search to meet target window sizes (8K, 16K, 32K, 64K tokens). Sampling probability for any table subset S is given by:

$P_{\mathrm{sampling}}(S) = \prod_{T_i \in S} \left(\frac{k_i}{N_i}\right)$

where $k_i$ is the number of sampled rows and $N_i$ is the original number of rows for table $T_i$ . Serialization is performed in Markdown, which has empirically demonstrated superior performance for LLMs over CSV formats.

3. Symbolic Extensions and Ground-Truth Generation

To ensure reliable, non-heuristic answer verification, TQA-Bench integrates symbolic operations within its question templates. For each instantiation, template variables (such as { AIRLINE_CODE }) are complemented by embedded scripts (Python or algebraic expressions) that compute the gold answer over the sampled tables. During inference, models see only the natural-language rendition (typically a multiple-choice question), while evaluation is conducted via the symbolic template’s computed answer.

The benchmark thus covers:

Simple lookups (entity focus, selection via PK/FK)
Aggregations (count, sum, average)
Composite calculations (e.g., differences of means, correlation coefficients) Symbolically-defined answers ensure robustness against data permutations and allow systematic difficulty scaling by varying the question’s arithmetic and relational complexity.

4. Evaluation Protocol and Task Taxonomy

Tasks are classified into three major categories and seven subcategories:

Lookup
- EL (Entity Lookup)
- TS (Top Selection by metric)
Aggregation
- CNT (Counting)
- SUM (Summation)
- AVG (Averaging)
Complex Calculation
- CS (Composite subtraction, e.g., avg difference)
- COR (Correlation coefficient computation)

Each question is presented as a four-option multiple-choice query. Reported metrics include:

Exact Match (EM): $\mathrm{EM} = \frac{|\text{CorrectAnswers}|}{|\text{TotalQuestions}|}$
Subcategory- and difficulty-stratified accuracy

Each sampled instance (database plus tables) generates up to ~98 questions, enabling distributional analysis as context length and schema complexity increase.

5. Experimental Setup and Model Coverage

TQA-Bench evaluates open- and closed-source LLMs across a wide range of sizes (7B to 70B parameters), including prominent foundation models:

GPT-4o (72B, 128K token window) and GPT-4o-mini (20B)
Qwen2.5 (3B/7B/14B/72B), Llama3.1 (8B/70B), Baichuan2 (7B/13B), GLM-4-9B, Mistral-7B, Vicuna (7B/13B), Gemma2 (2B/9B/27B), DeepSeek-V2 (MoE 15.7B)
Table-specialized models (TableLlama, TableGPT2)

All models receive the same prompt template, with no model-specific tuning. Chat-style models are included for comparison, but generally perform worse due to non-adherence to the MCQ format.

6. Results, Insights, and Limitations

TQA-Bench reveals systematic trends:

Markdown serialization is preferable to CSV for LLM ingestion (e.g., GPT-4o: 78.7% EM vs 72.96% at 8K tokens)
Instruction-tuned "Instruct" models scale with parameter count, but gains attenuate as context window increases:
- Llama3.1-70B: 62.9% EM (8K) → 47.9% (64K)
- Qwen2.5-14B: 59.4% (8K) → 41.3% (64K)
- GPT-4o: 78.7% (8K) → 63.4% (64K)
Chat-only models (Vicuna, Baichuan-Chat, GLM-Chat) underperform (<25% EM) due to MCQ format noncompliance
Table-specialized LLMs (TableLlama, TableGPT2) do not generalize well to multi-table unseen structures
Task subcategory matters: accuracy decays from lookup (EL) and count (CNT) to complex calculations (COR), which drop below 20% EM in open-source models at large context

Batch-level variance is significant, but increasing the number of symbolic question templates yields robust, stable benchmarking distributions.

7. Implications, Recommendations, and Future Directions

TQA-Bench demonstrates that even state-of-the-art LLMs, including GPT-4o, remain challenged by reasoning over long, multi-table relational contexts—particularly for compositional tasks requiring symbolic aggregation or correlation. Instruction-tuning and scaling help, but significant gaps persist.

The authors recommend:

Integrating symbolic and program-aided decomposition (SQL/Python intermediate execution) to close arithmetic and aggregation reliability gaps
Table-centric pretraining with large, relational corpora and realistic foreign key structures
Retrieval-augmented prompting and chain-of-thought mechanisms to reduce context overload
Explicit evaluation on schemas with cyclic or recursive foreign keys, and continued research on unifying text-to-SQL planning with end-to-end Table-QA

Open questions include efficient alignment of LLM attention across ≥100K token windows, leveraging mixture-of-experts or retrieval-augmented architectures for scalable relational data handling, and designing robust QA pipelines that blend formal symbolic computation with natural language understanding.

TQA-Bench thus fills a critical gap in the evaluation of LLMs for advanced enterprise-grade table reasoning and lays the groundwork for new research in LLM-based data management systems (Qiu et al., 2024).

Markdown Upgrade to Chat

References (1)

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TQA-Bench.

TQA-Bench: Multi-Table QA Benchmark

1. Motivation: The Need for Multi-Table QA Benchmarks

2. Benchmark Construction and Data Sources

3. Symbolic Extensions and Ground-Truth Generation

4. Evaluation Protocol and Task Taxonomy

5. Experimental Setup and Model Coverage

6. Results, Insights, and Limitations

7. Implications, Recommendations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

TQA-Bench: Multi-Table QA Benchmark

1. Motivation: The Need for Multi-Table QA Benchmarks

2. Benchmark Construction and Data Sources

3. Symbolic Extensions and Ground-Truth Generation

4. Evaluation Protocol and Task Taxonomy

5. Experimental Setup and Model Coverage

6. Results, Insights, and Limitations

7. Implications, Recommendations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research