Quantifying reasoning versus retrieval in agent performance
Determine the extent to which observed performance improvements of large language model–based AI research agents on MLE-bench Kaggle tasks are attributable to genuine reasoning as opposed to latent data retrieval of publicly available solutions encountered during pre-training.
References
While performance gains scale with increased compute, it remains difficult to determine how much of the improvement stems from genuine reasoning versus latent data retrieval.
— AIRA_2: Overcoming Bottlenecks in AI Research Agents
(2603.26499 - Hambardzumyan et al., 27 Mar 2026) in Section 7 (Limitations), Data Contamination