Variational Problem Synthesis for LLMs

Updated 1 March 2026

Variational Problem Synthesis (SvS) is a framework that generates diverse, answer-aligned problems from correct solutions to sustain exploration in LLM training.
The methodology integrates RL with a conditional variation generator to enhance performance metrics like Pass@k while preserving reference answers.
Empirical results show significant benchmark improvements, sustained policy entropy, and extended model reasoning capabilities during training.

Variational Problem Synthesis (SvS) is a self-improvement framework for training LLMs via Reinforcement Learning with Verifiable Rewards (RLVR), designed to sustain policy entropy and improve generative diversity, especially for complex reasoning tasks. SvS augments RLVR training by synthesizing new "variational problems" using the policy’s own correct solutions, ensuring the reference answer remains unchanged while introducing surface-level and structural diversity. The methodology demonstrates significant improvements in Pass@k performance metrics and sustains model exploration throughout extended training cycles (Liang et al., 19 Aug 2025).

1. Formalization and Mathematical Framework

Let $\mathcal{D} = \{(x, a)\}$ denote the dataset of original problem-answer pairs, where $x$ is a problem and $a$ is its reference answer. The LLM policy $\pi_\theta(y|x)$ (parameterized by $\theta$ ) generates a candidate solution $y$ to $x$ , awarded a correctness reward $R_{\mathrm{c}}(y, a) = \mathbb{I}(\Extract(y) = a) \in \{0,1\}$, where $\Extract(\cdot)$ retrieves a canonical answer from the CoT solution (Liang et al., 19 Aug 2025).

SvS introduces a conditional variation generator $q_\theta(\hat{x}|y)$ , implemented as a policy call that produces candidate variational problems $\hat{x}$ from a correct solution $y$ . The synthetic problem $\hat{x}$ must yield the same reference answer $a$ and is evaluated using a reward: $R_{\mathrm{v}}(\hat{x}) = \mathbb{I}(\hat{\mathrm{acc}}(\hat{x}) \in [\hat{\alpha}_\ell,\,\hat{\alpha}_h])$ where

$\hat{\mathrm{acc}}(\hat{x}) = \frac{1}{G} \sum_{k=1}^G R_{\mathrm{c}}(\hat{y}_k, a)$

with $\hat{y}_k \sim \pi_\theta(\cdot|\hat{x})$ and $G$ the number of sample solutions per problem. Only synthetic problems falling into a moderate-accuracy window (e.g., $[12.5\%, 62.5\%]$ ) are selected.

The combined RLVR objective is augmented accordingly: $\begin{aligned} \mathcal{J}(\theta) = &\ \mathbb{E}_{(x, a) \sim \mathcal{D}} \Bigg[ \underbrace{\mathbb{E}_{y_i \sim \pi_\theta(\cdot|x)}[A_{i, t}\log\pi_\theta(y_{i, t}|x, y_{i, <t})]}_{\text{Original Problem Solving}} \ &+ \underbrace{ \sum_{y_i:\, \text{correct}}\, \mathbb{E}_{\hat{x} \sim q_\theta(\cdot|y_i)} \Big[ R_{\mathrm{v}}(\hat{x}) \log\pi_\theta(\hat{x}|y_i) + \mathbb{E}_{\hat{y}_k \sim \pi_\theta(\cdot|\hat{x})} R_{\mathrm{c}}(\hat{y}_k, a) \log\pi_\theta(\hat{y}_{k,t}|\hat{x}, \hat{y}_{k,<t}) \Big] }_{\text{Variational Problem Synthesis and Solving}} \Bigg] - \beta\, \mathrm{KL}(\pi_\theta \Vert \pi_{\mathrm{ref}}) \end{aligned}$ where $A_{i,t}$ is the group-normalized advantage and $\beta$ penalizes deviation from the reference model (Liang et al., 19 Aug 2025).

2. Workflow and Algorithmic Steps

The SvS framework interleaves original problem solving with variational synthesis. The operational procedure proceeds as follows:

Original Problem Solving: For each $(x, a)$ in a sampled batch, generate $G$ candidate solutions, compute their correctness, and store transitions for further processing.
Challenging Problem Identification: Problems with intermediate accuracy $[\alpha_\ell, \alpha_h]$ (e.g., $12.5\%$ – $50\%$ ) are selected as candidates for synthesis.
Variational Problem Synthesis: For each correct solution $y_i$ of a challenging problem, generate $G_v$ synthetic problems $\hat{x}$ using $\pi_\theta(\cdot\,|\,\text{“Synthesize problem from solution}\ y_i\text{”})$ . Each $\hat{x}$ is then scored for moderate difficulty based on synthetic answer accuracy.
Synthetic Problem Solving: Synthetic problems $\hat{x}$ with non-trivial but non-perfect accuracy have their solution trajectories added to the replay buffer.
Policy Update: Using Group Relative Policy Optimization (GRPO), the policy is updated from transitions accumulated in the buffer (Liang et al., 19 Aug 2025).

Key operational constraints are: synthesis is performed only using correct CoT solutions; reference answers are strictly preserved; only synthetic problems in the prescribed accuracy window are retained for effective learning signal.

3. Entropy Dynamics and Exploration

A principal motivation for SvS is to mitigate the monotonic entropy collapse observed in standard RLVR training. Policy entropy at the token level is given by

$\mathcal{H}\left(\pi_\theta(\cdot|x)\right) = -\sum_y \pi_\theta(y|x)\,\log\pi_\theta(y|x)$

Under vanilla RLVR, average entropy $\langle \mathcal{H}(\pi_{\theta_t}) \rangle$ decreases steadily, reducing solution diversity and harming high- $k$ Pass@k metrics. SvS, by dynamically augmenting the training problem distribution $q_\theta(\hat{x})$ , sustains exploration by exposing the policy to a broader array of reasoning trajectories at each iteration.

Empirical results (Figure 1 in (Liang et al., 19 Aug 2025)) demonstrate that SvS stabilizes entropy at a positive plateau, rather than decaying to near zero. While a formal lower bound is not derived, the continual injection of novel, answer-aligned problems empirically keeps the support of $\pi_\theta$ broad and maintains generative diversity.

4. Experimental Results and Quantitative Performance

SvS achieves pronounced and sustained improvements on established mathematical reasoning benchmarks. Empirical findings include:

Benchmark	Pass@1 (RLVR / SvS)	Pass@32 (RLVR / SvS)	Absolute Gain (Pass@32)
AIME24	22.2% / 30.3%	47.4% / 63.6%	+16.2%
AIME25	15.8% / 21.7%	36.4% / 55.1%	+18.7%

On the DAPO-17k dataset, Pass@32 gains reach +18.3% (AIME24) and +22.8% (AIME25). These improvements are robust across benchmarks, model sizes (3B–32B), and are sustained even as training is extended (Figures 1–3, 8 in (Liang et al., 19 Aug 2025)). Notably, whereas RLVR tends to plateau after approximately 450 steps, SvS continues to yield improvements, and scaling Pass@k high (e.g., $k=1024$ ) shows strengthened model reasoning capability at the exploration frontier.

5. Implementation Characteristics

Problem and solution encodings employ token sequences compatible with the LLM’s tokenizer and utilize chat-style Chain-of-Thought (CoT) reasoning with explicit answer extraction. Representative training pipeline details include:

Batch size: 256 problems per iteration
Group size: $G = 8$ solutions per problem, $G_v = 8$ variational syntheses per correct solution
Selection windows: $[\alpha_\ell, \alpha_h] = [12.5\%, 50\%]$ for problem difficulty; $[\hat{\alpha}_\ell, \hat{\alpha}_h] = [12.5\%, 62.5\%]$ for positive synthesis reward
Optimization: GRPO algorithm, learning rate $1{\rm e}{-6}$ , Clip-Higher, token-level loss, temperature $1.0$, KL regularization ( $\beta$ )
Training steps: 300 for MATH-12k, 600 for DAPO-17k (Liang et al., 19 Aug 2025)

The entire SvS loop is executed online, making synthesis and solution-grading tightly integrated with RLVR optimization.

6. Limitations, Open Directions, and Extensions

SvS’s effectiveness depends on the quality of underlying CoT solutions; if these are shallow or trivial, the resulting variational problems may lack diversity. Careful tuning of the reward-shaping windows $[\hat{\alpha}_\ell, \hat{\alpha}_h]$ is necessary to prevent information leakage or the creation of too-easy problems. On datasets with constrained answer formats (e.g., integer-only), SvS may overfit, slightly reducing generalization to open-ended settings.

Potential research directions include integrating SvS with alternative RL optimizers (PPO, DPO, Reinforce++), applying the approach to domains beyond mathematics (such as code synthesis and scientific QA), constructing adaptive accuracy-window curricula, and developing theoretical lower bounds on the sustained policy entropy under continual augmentation.

In summary, Variational Problem Synthesis establishes a lightweight, on-policy augmentation framework for RLVR. By leveraging the model’s own verified reasoning trajectories to synthesize novel, answer-aligned training data, SvS demonstrably sustains generative diversity, mitigates entropy collapse, and enables long-term improvements in Pass@k and exploration-oriented reasoning benchmarks (Liang et al., 19 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Variational Problem Synthesis (SvS).