Autocurriculum with self-play and evolving coverage

Determine how verifier-guided autocurriculum behaves when the reference model is updated via self-play across rounds, and ascertain whether the coverage-dependent burn-in compute cost that scales proportionally to d times kappa in AutoTune.RL can be further reduced when kappa, the sequence-level coverage coefficient of the reference model defined by Pr(y* | x) ≥ 1/kappa for the optimal chain-of-thought y*, improves over iterations.

Background

The paper’s RLVR results assume a fixed reference model that satisfies sequence-level coverage with parameter kappa, meaning the reference assigns probability at least 1/kappa to the optimal chain-of-thought on each prompt. Under this assumption, the AutoTune.RL procedure achieves accuracy 1 − ε with total rollout cost scaling as O(d·kappa + d/ε), where the O(d·kappa) term is a coverage-dependent burn-in cost and the O(d/ε) term is nearly coverage-independent.

In practice, methods like ReST and expert iteration repeatedly use the current model as the next round’s reference, potentially improving coverage over time. The authors explicitly identify understanding how autocurriculum interacts with such coverage improvement, and whether the burn-in cost can be reduced through self-play, as an open question.

References

Understanding how autocurriculum interacts with improving coverage across rounds, and whether the burn-in cost (d κ) can be further reduced through self-play, is an important open question.

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum  (2603.18325 - Rajaraman et al., 18 Mar 2026) in Discussion, Limitations and open directions — Self-play and iterated self-improvement