Autocurriculum under imperfect verification

Develop a theoretical framework for verifier-guided autocurriculum when the outcome verifier is imperfect, and establish guarantees for settings with noisy or learned reward models in place of the perfect outcome verifier assumed in the current analysis.

Background

All results in the paper assume access to a perfect outcome verifier that reliably determines whether the model’s final answer is correct for a given prompt, a natural assumption for verifiable domains such as mathematics and code. This assumption underpins both the SFT and RLVR analyses and algorithms (AutoTune and AutoTune.RL).

The authors explicitly state that extending their theory to scenarios with imperfect verification—such as noisy or learned reward models—is an open problem, highlighting the need to handle more realistic feedback conditions.

References

Our framework assumes access to a perfect outcome verifier, which is natural for domains with verifiable rewards (math, code), but extending the theory to noisy or learned reward models is an important open problem.

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum  (2603.18325 - Rajaraman et al., 18 Mar 2026) in Discussion, Limitations and open directions — Imperfect verification