Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Published 19 Aug 2025 in cs.CL | (2508.14029v1)

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training LLMs, particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a novel self-play strategy using variational problem synthesis to maintain policy entropy during RLVR training.
It generates diverse training datasets by synthesizing new problems from correct model solutions to enhance reasoning capability.
Empirical results show significant Pass@k performance gains, with up to 22.8% improvement on AIME benchmarks.

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Introduction

The paper presents a novel approach, "Self-Play with Variational Problem Synthesis" (SvS), designed to enhance the performance of Reinforcement Learning with Verifiable Rewards (RLVR) in training LLMs for complex reasoning tasks. Standard RLVR improves Pass@1 but often reduces policy entropy, leading to decreased model diversity and limit Pass@k performance improvements. The SvS strategy aims to sustain training diversity and improve Pass@k by enriching training datasets with variational problems generated from the model’s own solutions.

Methodology

The SvS strategy involves a cyclical training process comprising three main stages: original problem solving, variational problem synthesis, and synthetic problem solving.

Original Problem Solving: The policy attempts to solve problems from the original dataset. Correct solutions lead to identifying challenging problems, which serve as seeds for generating variational problems.
Variational Problem Synthesis: Correct solutions to challenging problems are used to synthesize new problems that maintain the same answers but vary structurally. This process ensures data diversity without additional verification efforts.
Synthetic Problem Solving: The policy solves these synthesized problems, promoting exploration and maintaining policy entropy.
Figure 1: The data workflow of our SvS in a training iteration, comprising original problem solving, variational problem synthesis, synthetic problem solving, and policy update data filtering.

Additionally, reward shaping is employed to ensure these problems remain challenging and accurately assess the policy’s reasoning capabilities.

Results

Experiments conducted on various LLMs (3B to 32B parameters) across 12 benchmarks demonstrate that SvS consistently outperforms traditional RLVR methods, showing an average absolute Pass@32 improvement of 18.3% and 22.8% on the AIME24 and AIME25 benchmarks, respectively.

Figure 2: Superior efficiency and effectiveness of SvS strategy on competition-level AIME benchmarks.

The SvS framework sustains entropy throughout training (Figure 3) and enhances Pass@k performance even as $k$ increases to 1024 (Figure 4), indicating an expansion of the model’s reasoning boundaries.

Figure 3: Policy entropy trajectories during training, displaying stability with SvS strategy across models.

Figure 4: Scaled-up Pass@k performance, showcasing significant gains on benchmarks with larger maximum response tokens.

Analysis

The SvS strategy maintains a stable policy entropy trajectory during RLVR training, facilitating sustained model exploration and mitigating the problem of training collapse seen in standard approaches. This maintenance of entropy correlates with improved exploration and higher reasoning capability thresholds as observed in Pass@k performance increments.

Conclusion

The self-play paradigm of SvS, through online problem augmentation without external guidance, effectively enhances the RLVR training of LLMs. This strategy not only consistently improves problem-solving metrics across various scales and benchmarks but also augments the problem diversity and maintains model engagement with challenging problem landscapes. Future work could explore integrating SvS with other RLVR algorithms to further explore its generalizability and potential in other reasoning-intensive domains.

Markdown Report Issue