SPIRAL: Teaching Language Models to Search, Explore, and Synthesize
This presentation explores SPIRAL, a reinforcement learning framework that trains language models to coordinate three inference primitives—sequential reasoning, parallel exploration, and aggregation—end-to-end. By aligning training with test-time scaffolds through set-based credit assignment, SPIRAL achieves up to 11-fold improvements in parallel scaling efficiency and 13.5% gains in recursive self-aggregation compared to standard RL methods, fundamentally changing how models utilize compute for complex reasoning tasks.Script
Most reinforcement learning for language models trains them to optimize a single chain of thought, but at test time we actually use scaffolds that coordinate sequential reasoning, parallel exploration, and synthesis. This mismatch cripples how models use compute when it matters most.
SPIRAL bridges this gap by training models end to end over all three primitives. The key insight is set reinforcement learning: instead of rewarding individual traces, SPIRAL assigns collective rewards to sets of generations, teaching the model to produce diverse candidates that work together during aggregation.
The results are striking. When scaling parallel compute, SPIRAL achieves 11 fold improvements in coverage efficiency compared to standard reinforcement learning. The model learns to generate traces that are individually diverse but collectively useful, maintaining high entropy where baselines collapse to repetitive outputs.
SPIRAL's advantage compounds under recursive self-aggregation, where the model repeatedly synthesizes sets of traces into refined outputs. Here, pass at 1 accuracy improves by 13.5 percent, demonstrating that SPIRAL-trained models produce traces that verify, refine, and combine more effectively than those from conventional training.
Learned aggregation decisively outperforms hand-crafted rules like majority voting in high compute regimes. While rule-based methods plateau, SPIRAL's model-based synthesis continues scaling, revealing that explicit credit assignment over candidate populations unlocks synergies that rigid heuristics cannot capture.
SPIRAL redefines how we train models for complex reasoning by aligning reinforcement learning with the scaffolds we actually deploy. By teaching models to coordinate search, exploration, and synthesis, this work opens a principled path toward systems that truly scale with compute. Explore the full paper and create your own video at EmergentMind.com.