Variational Problem Synthesis for LLMs
- Variational Problem Synthesis (SvS) is a framework that generates diverse, answer-aligned problems from correct solutions to sustain exploration in LLM training.
- The methodology integrates RL with a conditional variation generator to enhance performance metrics like Pass@k while preserving reference answers.
- Empirical results show significant benchmark improvements, sustained policy entropy, and extended model reasoning capabilities during training.
Variational Problem Synthesis (SvS) is a self-improvement framework for training LLMs via Reinforcement Learning with Verifiable Rewards (RLVR), designed to sustain policy entropy and improve generative diversity, especially for complex reasoning tasks. SvS augments RLVR training by synthesizing new "variational problems" using the policy’s own correct solutions, ensuring the reference answer remains unchanged while introducing surface-level and structural diversity. The methodology demonstrates significant improvements in Pass@k performance metrics and sustains model exploration throughout extended training cycles (Liang et al., 19 Aug 2025).
1. Formalization and Mathematical Framework
Let denote the dataset of original problem-answer pairs, where is a problem and is its reference answer. The LLM policy (parameterized by ) generates a candidate solution to , awarded a correctness reward $R_{\mathrm{c}}(y, a) = \mathbb{I}(\Extract(y) = a) \in \{0,1\}$, where $\Extract(\cdot)$ retrieves a canonical answer from the CoT solution (Liang et al., 19 Aug 2025).
SvS introduces a conditional variation generator , implemented as a policy call that produces candidate variational problems from a correct solution . The synthetic problem must yield the same reference answer and is evaluated using a reward: where
with and the number of sample solutions per problem. Only synthetic problems falling into a moderate-accuracy window (e.g., ) are selected.
The combined RLVR objective is augmented accordingly: where is the group-normalized advantage and penalizes deviation from the reference model (Liang et al., 19 Aug 2025).
2. Workflow and Algorithmic Steps
The SvS framework interleaves original problem solving with variational synthesis. The operational procedure proceeds as follows:
- Original Problem Solving: For each in a sampled batch, generate candidate solutions, compute their correctness, and store transitions for further processing.
- Challenging Problem Identification: Problems with intermediate accuracy (e.g., –) are selected as candidates for synthesis.
- Variational Problem Synthesis: For each correct solution of a challenging problem, generate synthetic problems using . Each is then scored for moderate difficulty based on synthetic answer accuracy.
- Synthetic Problem Solving: Synthetic problems with non-trivial but non-perfect accuracy have their solution trajectories added to the replay buffer.
- Policy Update: Using Group Relative Policy Optimization (GRPO), the policy is updated from transitions accumulated in the buffer (Liang et al., 19 Aug 2025).
Key operational constraints are: synthesis is performed only using correct CoT solutions; reference answers are strictly preserved; only synthetic problems in the prescribed accuracy window are retained for effective learning signal.
3. Entropy Dynamics and Exploration
A principal motivation for SvS is to mitigate the monotonic entropy collapse observed in standard RLVR training. Policy entropy at the token level is given by
Under vanilla RLVR, average entropy decreases steadily, reducing solution diversity and harming high- Pass@k metrics. SvS, by dynamically augmenting the training problem distribution , sustains exploration by exposing the policy to a broader array of reasoning trajectories at each iteration.
Empirical results (Figure 1 in (Liang et al., 19 Aug 2025)) demonstrate that SvS stabilizes entropy at a positive plateau, rather than decaying to near zero. While a formal lower bound is not derived, the continual injection of novel, answer-aligned problems empirically keeps the support of broad and maintains generative diversity.
4. Experimental Results and Quantitative Performance
SvS achieves pronounced and sustained improvements on established mathematical reasoning benchmarks. Empirical findings include:
| Benchmark | Pass@1 (RLVR / SvS) | Pass@32 (RLVR / SvS) | Absolute Gain (Pass@32) |
|---|---|---|---|
| AIME24 | 22.2% / 30.3% | 47.4% / 63.6% | +16.2% |
| AIME25 | 15.8% / 21.7% | 36.4% / 55.1% | +18.7% |
On the DAPO-17k dataset, Pass@32 gains reach +18.3% (AIME24) and +22.8% (AIME25). These improvements are robust across benchmarks, model sizes (3B–32B), and are sustained even as training is extended (Figures 1–3, 8 in (Liang et al., 19 Aug 2025)). Notably, whereas RLVR tends to plateau after approximately 450 steps, SvS continues to yield improvements, and scaling Pass@k high (e.g., ) shows strengthened model reasoning capability at the exploration frontier.
5. Implementation Characteristics
Problem and solution encodings employ token sequences compatible with the LLM’s tokenizer and utilize chat-style Chain-of-Thought (CoT) reasoning with explicit answer extraction. Representative training pipeline details include:
- Batch size: 256 problems per iteration
- Group size: solutions per problem, variational syntheses per correct solution
- Selection windows: for problem difficulty; for positive synthesis reward
- Optimization: GRPO algorithm, learning rate , Clip-Higher, token-level loss, temperature $1.0$, KL regularization ()
- Training steps: 300 for MATH-12k, 600 for DAPO-17k (Liang et al., 19 Aug 2025)
The entire SvS loop is executed online, making synthesis and solution-grading tightly integrated with RLVR optimization.
6. Limitations, Open Directions, and Extensions
SvS’s effectiveness depends on the quality of underlying CoT solutions; if these are shallow or trivial, the resulting variational problems may lack diversity. Careful tuning of the reward-shaping windows is necessary to prevent information leakage or the creation of too-easy problems. On datasets with constrained answer formats (e.g., integer-only), SvS may overfit, slightly reducing generalization to open-ended settings.
Potential research directions include integrating SvS with alternative RL optimizers (PPO, DPO, Reinforce++), applying the approach to domains beyond mathematics (such as code synthesis and scientific QA), constructing adaptive accuracy-window curricula, and developing theoretical lower bounds on the sustained policy entropy under continual augmentation.
In summary, Variational Problem Synthesis establishes a lightweight, on-policy augmentation framework for RLVR. By leveraging the model’s own verified reasoning trajectories to synthesize novel, answer-aligned training data, SvS demonstrably sustains generative diversity, mitigates entropy collapse, and enables long-term improvements in Pass@k and exploration-oriented reasoning benchmarks (Liang et al., 19 Aug 2025).