Are LMs performing genuine data-aware reasoning on tabular data?

Determine whether large language models deployed as autonomous data science agents over tabular data are engaging in genuine data-aware reasoning—i.e., detecting, reasoning over, and appropriately handling data artifacts in the provided tables—rather than merely repeating templated analyses that do not depend on the actual dataset’s state and structure.

Background

The paper motivates Radar by questioning whether LLMs truly reason over the data itself or simply apply canned patterns. Data-aware reasoning, as defined here, involves recognizing and addressing artifacts such as missing values, outliers, formatting inconsistencies, and logical contradictions that are pervasive in real-world tabular datasets.

Radar introduces programmatic perturbations and objective answer functions specifically to evaluate whether models actively identify and correct such artifacts, enabling a controlled test of whether their conclusions are grounded in the data rather than template-based responses.

References

It remains unclear whether they are merely repeating templated analyses or engaging in genuine data-aware reasoning—making decisions based on the actual state and structure of the dataset, much like an experienced data scientist would (Fig.~\ref{fig:teaser}).

— RADAR: Benchmarking Language Models on Imperfect Tabular Data (2506.08249 - Gu et al., 9 Jun 2025) in Section 1: Introduction

Are LMs performing genuine data-aware reasoning on tabular data?

Sponsor

Background

References

Related Problems