Quantifying reasoning versus retrieval in agent performance

Determine the extent to which observed performance improvements of large language model–based AI research agents on MLE-bench Kaggle tasks are attributable to genuine reasoning as opposed to latent data retrieval of publicly available solutions encountered during pre-training.

Background

The paper reports strong gains for an LLM-driven research agent on MLE-bench but cautions that many winning Kaggle solutions are publicly available and may have been seen during model pre-training. As compute and search iterations increase, there is a risk that the agent’s improvements arise from recalling memorized patterns instead of generating novel insights.

The authors note that this attribution problem persists despite careful evaluation design and suggest future assessment on closed or private benchmarks to isolate true research capability. Resolving this uncertainty is necessary to accurately measure genuine reasoning-driven progress in AI research agents.

References

While performance gains scale with increased compute, it remains difficult to determine how much of the improvement stems from genuine reasoning versus latent data retrieval.

AIRA_2: Overcoming Bottlenecks in AI Research Agents  (2603.26499 - Hambardzumyan et al., 27 Mar 2026) in Section 7 (Limitations), Data Contamination