Disentangle benchmark-format alignment from genuine research capability
Ascertain the extent to which the reported performance gains on the MLGym benchmark arise from improved alignment to the SWE-agent/MLGym execution format—including starter code structure, evaluation scripts, submission conventions, and turn-based reasoning–action loops—versus genuine improvements in machine learning research capability.
References
However, the structural scaffold (SWE-agent interaction format, turn-based reasoning-action loops) is shared by design, and we cannot fully disentangle format familiarity from substantive skill improvement with MLGym evaluation alone.
— AI Scientist via Synthetic Task Scaling
(2603.17216 - Cai et al., 17 Mar 2026) in Discussion – Benchmark-format alignment vs. general capability