Are LMs performing genuine data-aware reasoning on tabular data?
Determine whether large language models deployed as autonomous data science agents over tabular data are engaging in genuine data-aware reasoning—i.e., detecting, reasoning over, and appropriately handling data artifacts in the provided tables—rather than merely repeating templated analyses that do not depend on the actual dataset’s state and structure.
References
It remains unclear whether they are merely repeating templated analyses or engaging in genuine data-aware reasoning—making decisions based on the actual state and structure of the dataset, much like an experienced data scientist would (Fig.~\ref{fig:teaser}).
— RADAR: Benchmarking Language Models on Imperfect Tabular Data
(2506.08249 - Gu et al., 9 Jun 2025) in Section 1: Introduction