Scaling behavior of tabular reasoning with increasing input complexity

Characterize how language model performance on tabular reasoning scales as input complexity increases, including the effects of larger tables and longer contexts on reasoning accuracy.

Background

The authors note that existing table QA benchmarks typically use small, clean tables and do not systematically vary table size, leaving key aspects of long-context reasoning unassessed. Real-world tables often span hundreds or thousands of rows and require models to handle extended context while detecting and mitigating data artifacts.

Radar is designed to vary table size and dimensionality while holding task semantics constant, enabling investigation of scaling effects; the paper highlights that prior work has left this scaling behavior unresolved.

References

Moreover, they often overlook key factors such as table size, leaving open questions about how tabular reasoning scales with increasing input complexity.

RADAR: Benchmarking Language Models on Imperfect Tabular Data (2506.08249 - Gu et al., 9 Jun 2025) in Section 2: Background and Related Work