TableBench: A Comprehensive and Complex Benchmark for Table Question Answering (2408.09174v1)

Published 17 Aug 2024 in cs.CL

Abstract: Recent advancements in LLMs have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.

PDF HTML Abstract

A Comprehensive Evaluation of Table Question Answering in Industrial Scenarios

The paper "TableBench: A Comprehensive and Complex Benchmark for Table Question Answering" addresses the significant challenges faced by LLMs when tasked with processing tabular data in industrial settings. Despite notable advances in LLMs like GPT-3.5 and GPT-4, their practical application in table question answering (TableQA) often falls short due to the intricate reasoning required in real-world scenarios. This paper introduces TableBench, a carefully curated and complex benchmark designed to bridge the gap between academic-focused and industry-applicable evaluation frameworks.

Contributions and Methodology

The authors present several key contributions:

Development of TableBench: A meticulously annotated benchmark comprising 886 samples across 18 fields within four major categories of TableQA tasks. TableBench is crafted to facilitate tasks such as fact-checking, numerical reasoning, data analysis, and visualization.
Introduction of TableInstruct: A substantial TableQA instruction corpus encompassing various reasoning methodologies, including Textual Chain-of-Thought (TCoT), Symbolic Chain-of-Thought (SCoT), and Program-of-Thought (PoT).
Performance Evaluation: An extensive evaluation of over 30 LLMs, both proprietary and open-source, using the TableBench framework. The introduction of a leaderboard assesses model capabilities across the aforementioned TableQA tasks.

The paper categorizes the task complexity based on the required reasoning steps and employs a rigorous annotation workflow to construct TableBench. The benchmark's tasks demand multi-step reasoning, highlighting the complexities encountered in real-world applications.

Results and Observations

Experimental results underscore that, while improvements have been made, there remains a significant gap between LLM performance and human capabilities in TableQA tasks. Notably, even advanced models like GPT-4 achieve only a modest performance compared to human evaluators, particularly in scenarios necessitating nuanced computational reasoning and information synthesis.

The research demonstrates that proprietary methods and the fine-tuning of open-source models with the TableInstruct corpus yield substantial performance enhancements. The TableLLM, finetuned on open-source models, reaches performances comparable to GPT-3.5, validating the efficacy of the proposed instructional framework.

Implications and Future Directions

The insights gained from TableBench formulation and evaluation have direct implications for both the practical deployment of LLMs in industrial settings and the theoretical understanding of LLM capabilities. The paper argues for the development of reasoning-enhanced LLMs that can handle more diverse and complex tabular data tasks.

The authors identify areas for future research, including improving the success rate of executable code in PoT scenarios and addressing the reasoning complexities associated with larger and more complex tables. This emphasizes the need for better instruction-following capabilities and enhanced comprehension of tabular nuances in LLMs.

Conclusion

In conclusion, the paper presents TableBench as a crucial step in advancing LLMs in TableQA tasks reflective of industrial needs. It provides a robust platform for evaluating model limitations and progress, thereby fostering further advancements in LLM architecture and training strategies. While the present models exhibit significant potential, continued refinement and innovation are necessary to achieve parity with human reasoning and comprehension in tabular data interpretation.