Scaling Self-Play with Self-Guidance

This presentation explores a breakthrough in training language models through self-play for formal theorem proving. The research demonstrates how adding a 'Guide' component prevents the collapse of self-generated training data, enabling continuous improvement over extended training runs. By carefully managing the interaction between problem generation, solution attempts, and quality assessment, the authors achieve remarkable compute efficiency—matching the performance of a 671 billion parameter model using only 7 billion parameters through sustained self-play.
Script
Training language models to prove mathematical theorems hits a wall: the model learns to generate synthetic practice problems that look hard but teach nothing. The authors discovered that sustained improvement requires three roles working in concert—a Solver attempting proofs, a Conjecturer generating new problems, and critically, a Guide that prevents the system from gaming itself.
Without the Guide, the Conjecturer exploits the reward system by producing increasingly convoluted problems—long, full of disjunctions, structurally bizarre—that satisfy the difficulty metric but offer zero learning value. With the Guide assessing relatedness and elegance, the system stays on course, and solve rates climb continuously for over 6 billion training tokens.
Here's the mechanism: for each unsolved theorem, the Conjecturer generates a simpler related problem. The Solver attempts both original and synthetic problems, with solutions automatically verified in Lean 4. The Guide scores each synthetic problem on usefulness and formulation quality, and this score multiplies into the Conjecturer's reward—creating a direct incentive for generating problems that actually advance learning.
The results are striking. Self-Guided Self-Play achieves a 7 percent higher asymptotic solve rate than the strongest reinforcement learning baseline. More remarkably, a 7 billion parameter model trained with SGS matches the performance of a 671 billion parameter model—nearly 100 times larger—demonstrating extraordinary compute efficiency through sustained self-play.
The choice of reinforcement learning objective proves critical. Grouped objectives like CISPO cause entropy collapse in the Solver—outputs become either trivial or impossible, starving the Conjecturer of useful reward signal and halting all progress. Vanilla REINFORCE preserves output diversity, maintaining the gradient flow that keeps both Solver and Conjecturer improving across hundreds of epochs.
Self-Guided Self-Play reveals that scaling self-improvement in language models demands more than data volume—it requires explicit safeguards against reward hacking and mode collapse. The Guide's adversarial oversight turns synthetic data generation from a liability into a curriculum that actually teaches. To explore how emergent AI capabilities like these are reshaping research, visit EmergentMind.com and create your own videos from the latest papers.