SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning (2506.24119v1)

Published 30 Jun 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Recent advances in reinforcement learning have shown that LLMs can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

Summary

The paper demonstrates that competitive self-play in multi-agent zero-sum games autonomously develops generalizable LLM reasoning skills without human-curated data.
It introduces a novel Role-conditioned Advantage Estimation method to stabilize policy gradients during multi-turn reinforcement learning in high-variance game environments.
Empirical results show 8–18% benchmark improvements, highlighting the framework’s potential for scalable, domain-agnostic reasoning enhancement.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

The SPIRAL framework presents a significant advance in the development of reasoning capabilities in LLMs through self-play in multi-turn, zero-sum games. The core contribution is the demonstration that competitive, multi-agent environments can serve as effective, domain-agnostic curricula for incentivizing the emergence of generalizable reasoning skills, without reliance on human-curated datasets or reward engineering.

Framework and Methodology

SPIRAL is built on a fully online, distributed actor-learner architecture that enables scalable self-play training for LLMs. The system supports multi-turn, multi-agent reinforcement learning (MARL) across a suite of two-player zero-sum games. The key technical innovation is the introduction of Role-conditioned Advantage Estimation (RAE), a variance reduction technique that maintains separate baselines for each player role and game, stabilizing policy gradient updates in the inherently high-variance, non-stationary setting of self-play.

The training process is formulated as a turn-level Markov Decision Process (MDP), where each action corresponds to a complete multi-token response, rather than a single token. Both players share a single policy network, with role conditioning implemented via system prompts. This shared-parameter approach ensures that as the model improves in one role, it faces a correspondingly stronger opponent, generating an automatic and continually evolving curriculum.

Empirical Results

The empirical evaluation is comprehensive, spanning both in-domain and out-of-domain generalization. Notably, SPIRAL-trained models, starting from Qwen3-4B-Base and trained solely on Kuhn Poker, achieve:

8.6% improvement on mathematical reasoning benchmarks
8.4% improvement on general reasoning benchmarks
18.1% improvement on Minerva Math

These gains are achieved without exposure to any mathematical content or domain-specific data during training, and surpass models fine-tuned on 25,000 expert game trajectories. The transfer is robust, with improvements observed across MATH500, AIME, OlympiadBench, AMC, Minerva Math, GPQA, and MMLU-Pro.

Further, multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) yields synergistic benefits, with the multi-game model outperforming single-game specialists on both training and out-of-distribution games. Application to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) still produces a 2.0% average improvement, indicating the generality of the approach.

Analysis of Reasoning Transfer

A detailed analysis using LLM-as-judge methodology reveals that the reasoning skills acquired through self-play transfer to academic problem-solving via three cognitive patterns:

Case-by-Case Analysis: Systematic enumeration of scenarios, highly transferable across domains.
Expected Value Calculation: Probabilistic reasoning, with selective transfer to math problems involving uncertainty.
Pattern Recognition: Identification of regularities, with amplification observed in mathematical domains.

The evolution of these patterns is tracked across training checkpoints, showing that competitive self-play not only increases their frequency in game contexts but also induces their emergence in mathematical reasoning tasks.

Ablation and Curriculum Analysis

Ablation studies underscore the necessity of RAE for stable training. Without RAE, models exhibit "thinking collapse," truncating reasoning traces and converging to degenerate policies that abandon structured thought. This leads to catastrophic drops in reasoning benchmark performance and unstable gradient dynamics.

Comparisons with fixed-opponent training (random, Mistral, Gemini) highlight the superiority of self-play's automatic curriculum. Fixed opponents either induce collapse (random) or overfitting to static strategies (model-based), whereas self-play maintains adaptive difficulty, preventing exploitation and promoting continual improvement.

Implications and Future Directions

The findings have several important implications:

Autonomous Reasoning Development: SPIRAL demonstrates that LLMs can autonomously develop general reasoning skills through environmental challenge, without human supervision or domain-specific data.
Game Environments as Reasoning Gymnasiums: Different games cultivate distinct cognitive abilities (spatial, probabilistic, strategic), and their combination yields more robust, transferable reasoning.
Scalability and Generalization: The approach is scalable to larger models and diverse games, with evidence of transfer to both unseen games and academic benchmarks.

However, the approach is not without limitations. The reliance on designed game environments, substantial computational requirements (8 H100 GPUs for 25 hours per experiment), and the focus on academic rather than real-world reasoning tasks are noted constraints. Performance plateaus after extended training, and the transfer to domains requiring common sense or ethical judgment remains untested.

Future research directions include:

Extending to cooperative and partially observable games
Designing environments targeting specific reasoning weaknesses
Scaling to more complex, open-ended environments
Investigating the mechanisms underlying the emergence of transferable reasoning

Conclusion

SPIRAL provides compelling evidence that self-play in zero-sum games can serve as a powerful, scalable mechanism for developing general reasoning in LLMs. By leveraging competitive dynamics and multi-agent interaction, the framework eliminates the need for human supervision and curated data, instead relying on the intrinsic structure of games to drive cognitive skill acquisition. This paradigm has the potential to shift the development of reasoning in AI from supervised learning to autonomous, environment-driven curricula, with broad implications for the future of artificial general intelligence.

PDF Markdown

Follow-up Questions

Related Papers

Authors (12)

Tweets

https://twitter.com/_AndrewZhao/status/1939880779842298249

https://twitter.com/ceobillionaire/status/1940186547128787391

https://twitter.com/LeonGuertler/status/1940060124091146670

https://twitter.com/vikhyatk/status/1950973775622856771

https://twitter.com/fly51fly/status/1940161803226648781

https://twitter.com/Montreal_AI/status/1940186773713494404

YouTube

Show All Videos

alphaXiv

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning (79 likes, 0 questions)