This paper, "How well do LLMs reason over tabular data, really?" (Wolff et al., 12 May 2025 ), investigates the true tabular reasoning capabilities of LLMs under realistic conditions. It highlights that previous evaluations often rely on flawed metrics or simplistic setups, failing to capture how LLMs perform on tabular data with common issues like missing values or duplicates.
The authors make several key contributions:
- Critique of Existing Metrics: The paper demonstrates that standard free-form text metrics like SacreBleu and BERT-score are unreliable for evaluating LLMs on tabular reasoning tasks. This is because ground-truth answers are often short values or snippets, while LLMs produce longer, explanatory text. The score distributions for correct and incorrect answers show significant overlap, making them indistinguishable.
- Validation of LLM-as-a-Judge: To address the evaluation problem, the authors propose and validate using an LLM as an evaluator (LLM-as-a-judge). They show that a capable LLM (Qwen2.5 32B) can reliably judge the correctness of open-ended responses against ground-truth answers, achieving over 95% accuracy compared to human annotation. This approach allows for a more realistic evaluation of LLMs' open-form reasoning output.
- Extended Benchmark with Real-World Variations: Building on the TQA-Bench dataset [qiu2024tqa], the authors create variations of the tabular inputs to reflect common characteristics found in practice:
- Missing Values: Relevant data points needed for the answer are removed, and the ground truth is recalculated (treating missing as 0). Evaluation assesses both accuracy and whether the LLM acknowledges the missing data.
- Duplicate Entities: Duplicate rows are introduced. Evaluation assesses accuracy (ignoring duplicates) and acknowledgment of duplicates.
- Structural Variations: Column order is shuffled to test robustness to presentation changes. Evaluation focuses on accuracy remaining consistent.
- Evaluation of LLMs on Downscaled Data: The paper evaluates several LLMs (Llama3.1 8B, Mistral 7B, Qwen2.5 7B, DeepSeek-R1 7B, and GPT-4o-mini) on downscaled versions of the TQA-Bench tables (1K to 8K tokens). Using the LLM-as-a-judge, they find:
- A significant performance gap compared to previous multiple-choice evaluations on TQA-Bench. LLMs perform much worse when evaluated on open-ended answers.
- Performance generally decreases as table size increases, particularly for complex calculations like averages and subtractions.
- GPT-4o-mini consistently performs better than the other tested models and is more robust to increasing table size.
- Basic tasks like entity lookups are generally easier, while complex aggregations and calculations are more challenging.
- Robustness Analysis: Evaluating LLMs against the realistic variations, the paper reveals:
- Missing Values: LLMs show mixed accuracy impacts depending on the task and model. While some models drop in performance, others show slight improvement (possibly due to correctly refusing to answer when data is missing). Models often acknowledge missing values (around 44-51% of the time) but not reliably enough for practical use.
- Duplicate Entities: Duplicates significantly impact accuracy across tasks, and LLMs are less likely to acknowledge duplicates (around 6-27% of the time) compared to missing values.
- Structural Variations: Column shuffling has a relatively small impact on performance compared to missing values or duplicates, suggesting LLMs are less sensitive to column order presentation.
Practical Implications:
The findings indicate that while LLMs show promise in tabular reasoning, they are currently not robust enough for reliable application in real-world scenarios where data quality issues (missing values, duplicates) are common. Developers should be aware that LLMs' performance can degrade significantly in the presence of such variations, and they often fail to explicitly acknowledge these issues in their output, which could lead to misinterpretation or incorrect downstream actions. The paper stresses the importance of improving LLMs' ability to handle and communicate uncertainty arising from imperfect tabular data, as well as the need for robust evaluation methods like LLM-as-a-judge for assessing their true capabilities. The results highlight the necessity for further research and development into LLM architectures and training strategies specifically aimed at enhancing robustness for tabular data reasoning.