Autocurriculum with self-play and evolving coverage
Determine how verifier-guided autocurriculum behaves when the reference model is updated via self-play across rounds, and ascertain whether the coverage-dependent burn-in compute cost that scales proportionally to d times kappa in AutoTune.RL can be further reduced when kappa, the sequence-level coverage coefficient of the reference model defined by Pr(y* | x) ≥ 1/kappa for the optimal chain-of-thought y*, improves over iterations.
References
Understanding how autocurriculum interacts with improving coverage across rounds, and whether the burn-in cost (d κ) can be further reduced through self-play, is an important open question.
— Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum
(2603.18325 - Rajaraman et al., 18 Mar 2026) in Discussion, Limitations and open directions — Self-play and iterated self-improvement