How well do LLMs reason over tabular data, really?
(2505.07453v2)
Published 12 May 2025 in cs.AI
Abstract: LLMs excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM's realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM's performance on analytical tabular queries? Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.
This paper, "How well do LLMs reason over tabular data, really?" (Wolff et al., 12 May 2025), investigates the true tabular reasoning capabilities of LLMs under realistic conditions. It highlights that previous evaluations often rely on flawed metrics or simplistic setups, failing to capture how LLMs perform on tabular data with common issues like missing values or duplicates.
The authors make several key contributions:
Critique of Existing Metrics: The paper demonstrates that standard free-form text metrics like SacreBleu and BERT-score are unreliable for evaluating LLMs on tabular reasoning tasks. This is because ground-truth answers are often short values or snippets, while LLMs produce longer, explanatory text. The score distributions for correct and incorrect answers show significant overlap, making them indistinguishable.
Validation of LLM-as-a-Judge: To address the evaluation problem, the authors propose and validate using an LLM as an evaluator (LLM-as-a-judge). They show that a capable LLM (Qwen2.5 32B) can reliably judge the correctness of open-ended responses against ground-truth answers, achieving over 95% accuracy compared to human annotation. This approach allows for a more realistic evaluation of LLMs' open-form reasoning output.
Extended Benchmark with Real-World Variations: Building on the TQA-Bench dataset [qiu2024tqa], the authors create variations of the tabular inputs to reflect common characteristics found in practice:
Missing Values: Relevant data points needed for the answer are removed, and the ground truth is recalculated (treating missing as 0). Evaluation assesses both accuracy and whether the LLM acknowledges the missing data.
Duplicate Entities: Duplicate rows are introduced. Evaluation assesses accuracy (ignoring duplicates) and acknowledgment of duplicates.
Structural Variations: Column order is shuffled to test robustness to presentation changes. Evaluation focuses on accuracy remaining consistent.
Evaluation of LLMs on Downscaled Data: The paper evaluates several LLMs (Llama3.1 8B, Mistral 7B, Qwen2.5 7B, DeepSeek-R1 7B, and GPT-4o-mini) on downscaled versions of the TQA-Bench tables (1K to 8K tokens). Using the LLM-as-a-judge, they find:
A significant performance gap compared to previous multiple-choice evaluations on TQA-Bench. LLMs perform much worse when evaluated on open-ended answers.
Performance generally decreases as table size increases, particularly for complex calculations like averages and subtractions.
GPT-4o-mini consistently performs better than the other tested models and is more robust to increasing table size.
Basic tasks like entity lookups are generally easier, while complex aggregations and calculations are more challenging.
Robustness Analysis: Evaluating LLMs against the realistic variations, the paper reveals:
Missing Values: LLMs show mixed accuracy impacts depending on the task and model. While some models drop in performance, others show slight improvement (possibly due to correctly refusing to answer when data is missing). Models often acknowledge missing values (around 44-51% of the time) but not reliably enough for practical use.
Duplicate Entities: Duplicates significantly impact accuracy across tasks, and LLMs are less likely to acknowledge duplicates (around 6-27% of the time) compared to missing values.
Structural Variations: Column shuffling has a relatively small impact on performance compared to missing values or duplicates, suggesting LLMs are less sensitive to column order presentation.
Practical Implications:
The findings indicate that while LLMs show promise in tabular reasoning, they are currently not robust enough for reliable application in real-world scenarios where data quality issues (missing values, duplicates) are common. Developers should be aware that LLMs' performance can degrade significantly in the presence of such variations, and they often fail to explicitly acknowledge these issues in their output, which could lead to misinterpretation or incorrect downstream actions. The paper stresses the importance of improving LLMs' ability to handle and communicate uncertainty arising from imperfect tabular data, as well as the need for robust evaluation methods like LLM-as-a-judge for assessing their true capabilities. The results highlight the necessity for further research and development into LLM architectures and training strategies specifically aimed at enhancing robustness for tabular data reasoning.