SynthRL: Scalable Data Synthesis for RLVR
- SynthRL is a scalable pipeline that generates verified, challenging training examples by augmenting easy seed questions while preserving answer consistency.
- It uses a three-step process—difficulty-based seed selection, targeted question augmentation via VLMs, and stringent Monte Carlo-based verification—to ensure higher complexity.
- Empirical results show that SynthRL expands datasets by 42% and boosts model performance on visual reasoning benchmarks, emphasizing its practical impact in RLVR tasks.
SynthRL refers to a scalable and verifiable pipeline for automatically synthesizing challenging data to improve reinforcement learning with verifiable reward (RLVR) in vision-LLMs, particularly for visual mathematical reasoning (2506.02096). SynthRL addresses the need for high-quality, difficult, and label-consistent training data in reasoning-oriented RL environments—where annotation costs are high and foundational models benefit most from harder, verifiably correct examples. It enables training curricula that are both enriched with more complex samples and rigorously free of label noise.
1. Design and Core Stages of SynthRL
SynthRL is defined by three sequential processes for synthesizing training data with guaranteed answer correctness and increased difficulty:
- Difficulty-based Seed Selection:
- Selects “easy” questions from a labeled dataset—those which a target VLM policy answers correctly and reliably.
- Evaluation is conducted via Monte Carlo rollout: each question is answered times with stochastic decoding; the number of correct answers () denotes the model's ease in answering.
- Targeted Question Augmentation (Synthesis):
- Each selected easy seed (image, question, answer) is transformed into a more challenging question , conditioned to retain the same answer .
- Augmentation is carried out by a powerful VLM (e.g., Gemini-2.5-Flash) acting as a synthesizer, given only the question and image without access to the ground-truth answer.
- Guaranteed Verification:
- Each candidate is subject to strict model-based verification:
- Correctness: The target VLM (used as a verifier) must answer with at least times across rollouts, confirming answerability.
- Difficulty: must be noticeably more difficult for the model to answer, i.e., is reduced by at least relative to the original.
- Only quadruples meeting both criteria are admitted to the expanded training set.
- Each candidate is subject to strict model-based verification:
This methodology guarantees that the synthesized samples are both reliably answerable and increase the demand for compositional or deeper reasoning.
2. Technical Implementation Details
- Seed Selection Formula:
Questions with are selected; typical , .
- Augmentation Prompt Example:
1 2 3 4 |
Given an image and the following question, transform it into a significantly more challenging version that requires deeper reasoning but maintains the same answer. Original Question: {question} Your Response Format: New Question: {Your transformed question} |
- Verification Criteria:
- Ablation and Quality Control:
- The pipeline performs multiple synthesis attempts for each seed.
- A supplementary LLM-based “judge” can flag redundant or low-quality augmentations, further filtering the candidate set.
3. Empirical Validation on Visual Math Reasoning
SynthRL was evaluated using the MMK12 dataset (8,072 open-ended visual math Q/A pairs):
- Data Expansion: 3,380 new, strictly verified, and more difficult questions were synthesized and added, yielding a 42% expansion.
- Difficulty Analysis: Synthesized questions exhibit a significantly lower mean pass rate and higher average reasoning step count (mean steps up by 33%), validating increased challenge and complexity.
- Training Protocol: Models were trained using Group Relative Policy Optimization (GRPO), a specific RLVR method.
Performance on Out-of-Domain Benchmarks
SynthRL’s impact was measured on five rigorous visual reasoning benchmarks (MathVerse, MathVision, MathVista, WeMath, DynaMath):
Data | MathVerse | MathVision | MathVista | WeMath | DynaMath | Average |
---|---|---|---|---|---|---|
MMK12 (seed) | 51.6 | 30.0 | 73.9 | 70.6 | 58.8 | 57.0 |
+SynthRL | 53.5 | 29.6 | 74.2 | 72.6 | 60.1 | 58.0 |
- Models trained with the augmented dataset outperformed the seed-only baseline on all but one benchmark, with greatest improvements observed in the most challenging evaluation subsets.
- Gains increased with the scale of the seed set, suggesting strong benefits from larger and more diverse data pools.
Impact on Hard Samples
- SynthRL preferentially raised accuracy on “medium” and “hard” benchmark examples as determined by Elo/Bradley-Terry ranking, with improvements of +1.7% and +1.6% at scale, respectively.
- Accuracy on “easy” samples remained stable or, in some cases, slightly decreased due to the greater model focus on compositional reasoning.
4. Architectural Schematic
A schematic summarizing the SynthRL system:
Stage | Input | Operation | Guarantee | Output |
---|---|---|---|---|
1. Seed Selection | , model policy | Monte Carlo rollouts, select “easy” samples | Reliable answerability | “Easy” Q/A/Image tuples |
2. Synthesis | Selected triplets | VLM-generated, harder question with same answer | Higher challenge | Candidate |
3. Verification | Candidates, model policy | Rollout-based correctness and difficulty tests | Answerable, more difficult | Verified, harder Q/A/Image tuples |
5. Significance and Broader Impacts
SynthRL advances RLVR for VLMs by supporting:
- Strictly verifiable, difficulty-controlled curriculum scaling, enabling models to develop more robust, compositional reasoning ability.
- Automated generation and verification of challenging data, minimizing human annotation costs and reducing label noise—a problematic issue in generative data augmentation.
- Practical curricula for reasoning over diverse inputs (images, equations, open-ended text), supporting transfer to broad, out-of-domain reasoning tasks.
This approach is shown to be robust across dataset sizes and is most effective when used to augment, not replace, core training data. A plausible implication is that as VLMs grow more powerful, such pipelines will be critical for pushing beyond current limitations in mathematical and multimodal reasoning.
6. Limitations and Directions for Future Research
- Current data synthesis is constrained by the performance and diversity of the synthesizer model; further gains may come from integrating multiple synthesizer architectures or models.
- Verification thresholds (minimum correct pass count, required increase in difficulty) may require retuning for different domains or levels of target model competence.
- Extending the approach to more complex modalities or domain-specific reasoning tasks, as well as to even larger datasets and more challenging benchmarks, is an outlined path for future research.
SynthRL establishes a general template for scalable, verifiably correct, and difficulty-progressive synthesis in reinforcement learning, with demonstrated utility in state-of-the-art visual mathematical reasoning models.