TQA-Bench: Multi-Table QA Benchmark
- TQA-Bench is a benchmark suite that systematically evaluates large language models on complex multi-table question answering tasks.
- It employs real-world datasets and innovative sampling methodologies to overcome the limitations of single-table evaluations.
- The benchmark integrates symbolic extensions for reliable answer verification, highlighting performance gaps in compositional reasoning over long contexts.
TQA-Bench is a benchmark and evaluation suite created to systematically assess the capabilities of LLMs in handling multi-table question answering (TQA) over relational databases. It is specifically designed to address limitations of prior table-QA benchmarks, which focus primarily on single-table settings and fail to capture the complex reasoning required for realistic, multi-table contexts in domains such as finance, healthcare, and e-commerce. TQA-Bench introduces datasets, methodology, and evaluation protocols that enable robust, fine-grained measurement of LLM performance on tasks that require scalable context handling, multi-step joins, aggregation, and symbolic reasoning (Qiu et al., 29 Nov 2024).
1. Motivation: The Need for Multi-Table QA Benchmarks
Prevailing Table-QA benchmarks—such as WikiTableQuestions and FeTaQA—feature single tables of modest size, which do not require reasoning over heterogenous schemas, multiple joins, or large-scale contexts. These benchmarks are inadequate for real-world scenarios, where operational databases may contain millions of rows distributed across 2–6 interconnected tables linked by foreign keys. Additionally, much of the existing content in widely used benchmarks is already exposed to LLMs through pretraining, confounding generalization analysis due to memorization or data leakage. Furthermore, fixed question sets in prior work tend to promote answer-pattern memorization rather than genuine compositional reasoning.
TQA-Bench systematically addresses these deficits by incorporating:
- Multiple, heterogenous tables per instance
- Large-scale, sampled contexts ranging from 8K to 64K tokens
- Complex relational operators (multi-way joins, aggregation, correlation)
- New symbolic extensions for precise, robust evaluation
2. Benchmark Construction and Data Sources
TQA-Bench leverages public, real-world datasets including:
- WorldBank: Two-table and large biodiversity datasets (up to 6×105 rows)
- DataGov: Water Quality and Food Facility Inspection (up to 1.6×106 rows per table, 2–4 tables each)
- BIRD: Seven enterprise-style, multi-table databases (2–6 tables) with referential integrity, derived from Text2SQL benchmarks after removing cycles and broken foreign key constraints
Sampling utilizes a topological sort of tables according to their foreign key relationships, followed by probabilistic row sampling rooted at source tables. The context size (token count after Markdown serialization) is precisely controlled via binary search to meet target window sizes (8K, 16K, 32K, 64K tokens). Sampling probability for any table subset S is given by:
where is the number of sampled rows and is the original number of rows for table . Serialization is performed in Markdown, which has empirically demonstrated superior performance for LLMs over CSV formats.
3. Symbolic Extensions and Ground-Truth Generation
To ensure reliable, non-heuristic answer verification, TQA-Bench integrates symbolic operations within its question templates. For each instantiation, template variables (such as { AIRLINE_CODE }) are complemented by embedded scripts (Python or algebraic expressions) that compute the gold answer over the sampled tables. During inference, models see only the natural-language rendition (typically a multiple-choice question), while evaluation is conducted via the symbolic template’s computed answer.
The benchmark thus covers:
- Simple lookups (entity focus, selection via PK/FK)
- Aggregations (count, sum, average)
- Composite calculations (e.g., differences of means, correlation coefficients) Symbolically-defined answers ensure robustness against data permutations and allow systematic difficulty scaling by varying the question’s arithmetic and relational complexity.
4. Evaluation Protocol and Task Taxonomy
Tasks are classified into three major categories and seven subcategories:
- Lookup
- Aggregation
- CNT (Counting)
- SUM (Summation)
- AVG (Averaging)
- Complex Calculation
- CS (Composite subtraction, e.g., avg difference)
- COR (Correlation coefficient computation)
Each question is presented as a four-option multiple-choice query. Reported metrics include:
- Exact Match (EM):
- Subcategory- and difficulty-stratified accuracy
Each sampled instance (database plus tables) generates up to ~98 questions, enabling distributional analysis as context length and schema complexity increase.
5. Experimental Setup and Model Coverage
TQA-Bench evaluates open- and closed-source LLMs across a wide range of sizes (7B to 70B parameters), including prominent foundation models:
- GPT-4o (72B, 128K token window) and GPT-4o-mini (20B)
- Qwen2.5 (3B/7B/14B/72B), Llama3.1 (8B/70B), Baichuan2 (7B/13B), GLM-4-9B, Mistral-7B, Vicuna (7B/13B), Gemma2 (2B/9B/27B), DeepSeek-V2 (MoE 15.7B)
- Table-specialized models (TableLlama, TableGPT2)
All models receive the same prompt template, with no model-specific tuning. Chat-style models are included for comparison, but generally perform worse due to non-adherence to the MCQ format.
6. Results, Insights, and Limitations
TQA-Bench reveals systematic trends:
- Markdown serialization is preferable to CSV for LLM ingestion (e.g., GPT-4o: 78.7% EM vs 72.96% at 8K tokens)
- Instruction-tuned "Instruct" models scale with parameter count, but gains attenuate as context window increases:
- Llama3.1-70B: 62.9% EM (8K) → 47.9% (64K)
- Qwen2.5-14B: 59.4% (8K) → 41.3% (64K)
- GPT-4o: 78.7% (8K) → 63.4% (64K)
- Chat-only models (Vicuna, Baichuan-Chat, GLM-Chat) underperform (<25% EM) due to MCQ format noncompliance
- Table-specialized LLMs (TableLlama, TableGPT2) do not generalize well to multi-table unseen structures
- Task subcategory matters: accuracy decays from lookup (EL) and count (CNT) to complex calculations (COR), which drop below 20% EM in open-source models at large context
Batch-level variance is significant, but increasing the number of symbolic question templates yields robust, stable benchmarking distributions.
7. Implications, Recommendations, and Future Directions
TQA-Bench demonstrates that even state-of-the-art LLMs, including GPT-4o, remain challenged by reasoning over long, multi-table relational contexts—particularly for compositional tasks requiring symbolic aggregation or correlation. Instruction-tuning and scaling help, but significant gaps persist.
The authors recommend:
- Integrating symbolic and program-aided decomposition (SQL/Python intermediate execution) to close arithmetic and aggregation reliability gaps
- Table-centric pretraining with large, relational corpora and realistic foreign key structures
- Retrieval-augmented prompting and chain-of-thought mechanisms to reduce context overload
- Explicit evaluation on schemas with cyclic or recursive foreign keys, and continued research on unifying text-to-SQL planning with end-to-end Table-QA
Open questions include efficient alignment of LLM attention across ≥100K token windows, leveraging mixture-of-experts or retrieval-augmented architectures for scalable relational data handling, and designing robust QA pipelines that blend formal symbolic computation with natural language understanding.
TQA-Bench thus fills a critical gap in the evaluation of LLMs for advanced enterprise-grade table reasoning and lays the groundwork for new research in LLM-based data management systems (Qiu et al., 29 Nov 2024).