Disentangle benchmark-format alignment from genuine research capability

Ascertain the extent to which the reported performance gains on the MLGym benchmark arise from improved alignment to the SWE-agent/MLGym execution format—including starter code structure, evaluation scripts, submission conventions, and turn-based reasoning–action loops—versus genuine improvements in machine learning research capability.

Background

The paper notes that while the synthetic tasks are diverse in content and grounded in real HuggingFace datasets, they share the same structural scaffold as MLGym and the SWE-agent interaction format. As a result, stronger performance on MLGym may partially reflect familiarity with the benchmark’s execution format rather than broader scientific or engineering skill.

Because evaluation is conducted only within MLGym, the authors state they cannot separate format-specific benefits from substantive capability gains. They suggest extending evaluation to benchmarks with different execution harnesses to address this limitation.

References

However, the structural scaffold (SWE-agent interaction format, turn-based reasoning-action loops) is shared by design, and we cannot fully disentangle format familiarity from substantive skill improvement with MLGym evaluation alone.

AI Scientist via Synthetic Task Scaling  (2603.17216 - Cai et al., 17 Mar 2026) in Discussion – Benchmark-format alignment vs. general capability