Papers
Topics
Authors
Recent
2000 character limit reached

Self-Play with Variational Synthesis (SvS)

Updated 27 November 2025
  • Self-play with Variational Synthesis (SvS) is a method that synthesizes answer-equivalent problem variants to counter entropy collapse in RLVR.
  • It leverages self-generated challenges through policy self-play and controlled difficulty criteria to enhance diversity in large language model training.
  • Experimental results show significant improvements in Pass@1 and Pass@32 metrics across competitive mathematical benchmarks compared to standard RLVR.

Self-play with Variational Synthesis (SvS) is a training methodology developed to address entropy collapse and generation diversity loss in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs in complex reasoning tasks. By actively synthesizing new training problems through policy self-play and careful answer-preserving perturbations, SvS sustains exploration during training and achieves consistently superior performance, particularly in high-kk Pass@kk regimes across competition-level mathematics benchmarks (Liang et al., 19 Aug 2025).

1. Motivation and Theoretical Preliminaries

RLVR is a paradigm for post-training adaptation of LLMs, in which exploration is incentivized via verifiable correctness signals as rewards. While standard RLVR significantly boosts Pass@1, it frequently induces a rapid decay in the model’s policy entropy. This entropy collapse reduces diversity in generated solutions, ultimately capping the effective Pass@kk performance, which reflects the upper bound of model reasoning capability. SvS targets this collapse by algorithmically augmenting and updating the pool of training problems, alternating between original data and a structured set of self-generated, answer-equivalent variants that challenge and diversify the policy.

Let πθ\pi_{\theta} denote the LLM policy with parameters θ\theta, πref\pi_{\mathrm{ref}} a reference model for regularization, and B\mathcal{B} an experience buffer of tuples (p,y,R)(p, y, R) where pp is a prompt, yy a sampled response, and RR a scalar reward. Policy updates in SvS minimize a regularized, advantage-weighted surrogate loss, described in Section 3.

2. Variational Problem Synthesis Mechanism

A central operation in SvS is variational problem synthesis, which leverages the model’s correct responses to transform under-performing source problems into answer-equivalent and moderately difficult variants. Given an original problem-answer pair (x,a)(x, a) and a successful solution yiy_i (i.e., Extract(yi)=a\mathrm{Extract}(y_i) = a), a synthesis function S\mathcal{S} controlled by the LLM is prompted as:

{x^j}j=1Gv=S(yi,a)\{\hat{x}^j\}_{j=1}^{G_v} = \mathcal{S}(y_i, a)

Here, GvG_v is the number of candidates synthesized per correct solution. Acceptance of x^\hat{x} is governed by two criteria:

  • Answer Preservation: Extract(Solve(x^))=a\mathrm{Extract}(\mathrm{Solve}(\hat{x})) = a, ensuring solution equivalence to the progenitor problem;
  • Moderate Difficulty: acc^Acc(x^,a)acc^h\hat{acc}_\ell \leq \mathrm{Acc}(\hat{x}, a) \leq \hat{acc}_h, limiting the acceptance to variants neither trivial nor unsolvable by the current policy, where Acc(x^,a)\mathrm{Acc}(\hat{x}, a) is computed empirically by sampling GG solutions.

Only those x^\hat{x} falling within these difficulty bounds are incorporated into the training buffer, thereby focusing the augmentation on meaningful and pedagogically valuable diversity (Liang et al., 19 Aug 2025).

3. Core Training Algorithm and Loss Construction

The online SvS training loop maintains a mixed buffer of original, synthetic-solve, and synthesis samples. The policy πθ\pi_\theta's correct solutions on moderately challenging instances seed the generative loop for new problems. The detailed algorithm executes as follows:

  1. Original Problem Solving: For each batch, sample GG solutions per problem, recording correctness indicators rir_i.
  2. Challenging Problem Selection: For problems with observed accuracy 0<Acc(x)<10 < \mathrm{Acc}(x) < 1, retain experience; if acc<Acc(x)<acch\mathrm{acc}_\ell < \mathrm{Acc}(x) < \mathrm{acc}_h, proceed with synthesis.
  3. Variational Synthesis: For all correct solutions, synthesize GvG_v new problems as above.
  4. Synthetic Problem Solving: Evaluate generated x^\hat{x} by sampling GG further solutions; reward both synthetic solves and synthesis events when the acceptance criteria are met.
  5. Policy Update: Apply one gradient step using the buffer and then flush.

The surrogate loss for policy update is:

JSvS(θ)=E(p,{yi})B  1Gi=1G1yit=1yi[min(ki,tAi,t,clip(ki,t,1ε,1+ε)Ai,t)βDKL[πθ(p,yi,<t)πref(p,yi,<t)]]\mathcal{J}_{\rm SvS}(\theta) = \mathbb{E}_{(p, \{y_i\}) \sim \mathcal{B}}\; \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \Bigl[\min(k_{i, t} A_{i, t},\, \mathrm{clip}(k_{i, t}, 1-\varepsilon, 1+\varepsilon)\,A_{i, t}) - \beta\, D_{\rm KL} [\pi_\theta(\cdot | p, y_{i, < t}) \| \pi_{\rm ref}(\cdot | p, y_{i, < t})]\Bigr]

with normalized token-level advantages Ai,tA_{i, t} and ratios ki,t(θ)k_{i, t}(\theta) defined as:

Ai,t=rirstd({ri}),ki,t(θ)=πθ(yi,tp,yi,<t)πold(yi,tp,yi,<t)A_{i, t} = \frac{r_i - \overline{r}}{\mathrm{std}(\{r_i\})}, \quad k_{i, t}(\theta) = \frac{\pi_\theta(y_{i, t} | p, y_{i,<t})}{\pi_{\rm old}(y_{i, t} | p, y_{i,<t})}

The total loss is:

L(θ)=JSvS(θ)\mathcal{L}(\theta) = -\mathcal{J}_{\rm SvS}(\theta)

An optional entropy bonus:

Lent(θ)=λHEp,yπθ[logπθ(yp)]\mathcal{L}_{\rm ent}(\theta) = -\lambda_H\, \mathbb{E}_{p, y \sim \pi_\theta}[-\log \pi_\theta(y | p)]

may also be included to further preserve exploration (Liang et al., 19 Aug 2025).

4. Monitoring Diversity and Entropy

To track diversity collapse, SvS monitors average token-level entropy:

H(πθ)=ExD,yπθ(x)[1yt=1yvVπθ(vx,y<t)logπθ(vx,y<t)]H(\pi_\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot | x)}\left[ -\frac{1}{|y|} \sum_{t=1}^{|y|}\sum_{v \in V} \pi_\theta(v | x, y_{<t}) \log \pi_\theta(v | x, y_{<t}) \right]

Sample-based trajectories are used to estimate entropy empirically. The online variational augmentation characteristic of SvS maintains this entropy nearly constant over training, in contrast to vanilla RLVR, which typically exhibits a monotonic decay, indicating exploration collapse. This stabilization is critical for robust Pass@kk performance and solution diversity (Liang et al., 19 Aug 2025).

5. Experimental Configuration and Quantitative Assessment

Key hyperparameters for SvS include batch size $256$, group size G=8G=8 solves per problem, Gv=8G_v=8 variational candidates per solution, learning rate 1×1061 \times 10^{-6}, and sampling temperature $1.0$. Problems with base accuracy in [12.5%,50%][12.5\%, 50\%] are targeted for synthesis, with generated variants accepted if their empirical solve rate is in [12.5%,62.5%][12.5\%, 62.5\%]. Training is conducted for up to $600$ steps on competitive benchmarks (Liang et al., 19 Aug 2025).

Performance improvements are consistently observed across a wide range of reasoning benchmarks and model sizes (3B to 32B):

Benchmark RLVR Pass@32 SvS Pass@32 Δ (SvS–RLVR)
AIME24 52.5 70.8 +18.3
AIME25 42.4 65.2 +22.8
BAIME 35.9 45.9 +10.0
Math24o 71.2 76.5 +5.3
Avg 44.6 53.1 +8.5

Corresponding Pass@1 improvements, averaged over 32 samples:

Benchmark RLVR Pass@1 SvS Pass@1 Δ (SvS–RLVR)
AIME24 28.8 39.3 +10.5
AIME25 30.0 40.5 +10.5
BAIME 14.0 19.2 +5.2
Math24o 39.6 44.1 +4.5
Avg 22.5 27.9 +5.4

Across 12 benchmarks and multiple sizes, SvS averts the entropy collapse seen in RLVR, yielding approximately three point gains in Pass@1 and up to 22.8% in Pass@32 on the hardest benchmarks (Liang et al., 19 Aug 2025).

6. Generalization, Robustness, and Implications

SvS demonstrates generalizability and robustness across benchmarks and model scales. Its self-improving, self-play architecture provides a systematic mechanism for training large LLM policies to retain exploration, thereby maximizing the effective reasoning capacity as measured by Pass@kk. A plausible implication is that answer-preserving variational synthesis may serve as a foundational tool for adaptive curriculum construction in RLVR settings, particularly in domains where overfitting to narrow solution spaces or entropy collapse are recurrent.

7. Directions for Investigation and Limitations

While SvS addresses the core problem of entropy decay in RLVR and achieves sustained improvements in high-kk metrics, further research could explore extensions to domains outside mathematical reasoning, alternative synthesis acceptance criteria, and potential trade-offs in computational overhead associated with online variant generation. The method’s reliance on answer extractors and empirical accuracy bounds frames potential limits in settings with noisy reward signals or ambiguous problem-answer pairs (Liang et al., 19 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Self-play with Variational Synthesis (SvS).