Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

SynthRL: Scalable Data Synthesis for RLVR

Updated 30 June 2025

SynthRL is a scalable pipeline that generates verified, challenging training examples by augmenting easy seed questions while preserving answer consistency.
It uses a three-step process—difficulty-based seed selection, targeted question augmentation via VLMs, and stringent Monte Carlo-based verification—to ensure higher complexity.
Empirical results show that SynthRL expands datasets by 42% and boosts model performance on visual reasoning benchmarks, emphasizing its practical impact in RLVR tasks.

SynthRL refers to a scalable and verifiable pipeline for automatically synthesizing challenging data to improve reinforcement learning with verifiable reward (RLVR) in vision-LLMs, particularly for visual mathematical reasoning (2506.02096). SynthRL addresses the need for high-quality, difficult, and label-consistent training data in reasoning-oriented RL environments—where annotation costs are high and foundational models benefit most from harder, verifiably correct examples. It enables training curricula that are both enriched with more complex samples and rigorously free of label noise.

1. Design and Core Stages of SynthRL

SynthRL is defined by three sequential processes for synthesizing training data with guaranteed answer correctness and increased difficulty:

Difficulty-based Seed Selection:
- Selects “easy” questions from a labeled dataset—those which a target VLM policy answers correctly and reliably.
- Evaluation is conducted via Monte Carlo rollout: each question is answered $N$ times with stochastic decoding; the number of correct answers ( $C_\mathrm{pass}$ ) denotes the model's ease in answering.
Targeted Question Augmentation (Synthesis):
- Each selected easy seed $(I, Q, A)$ (image, question, answer) is transformed into a more challenging question $Q_\mathrm{cand}$ , conditioned to retain the same answer $A$ .
- Augmentation is carried out by a powerful VLM (e.g., Gemini-2.5-Flash) acting as a synthesizer, given only the question and image without access to the ground-truth answer.
Guaranteed Verification:
- Each candidate $(I, Q_\mathrm{cand}, A)$ $(I, Q_{cand}, A)$ is subject to strict model-based verification:
  - Correctness: The target VLM (used as a verifier) must answer $Q_\mathrm{cand}$ with $A$ at least $T_\mathrm{min}$ times across $N$ rollouts, confirming answerability.
  - Difficulty: $Q_\mathrm{cand}$ must be noticeably more difficult for the model to answer, i.e., $C_\mathrm{pass}$ is reduced by at least $\Delta_\mathrm{hard}$ relative to the original.
- Only quadruples meeting both criteria are admitted to the expanded training set.

This methodology guarantees that the synthesized samples are both reliably answerable and increase the demand for compositional or deeper reasoning.

2. Technical Implementation Details

Seed Selection Formula:

$C_\mathrm{pass}(I, Q, A; \pi) = \sum_{j=1}^N \mathbb{I}(A^{(j)}_\text{pred} = A)$

Questions with $C_\mathrm{pass} \geq T$ are selected; typical $N=16$ , $T=12$ .

Augmentation Prompt Example:

Given an image and the following question, transform it into a significantly more challenging version that requires deeper reasoning but maintains the same answer.
Original Question: {question}
Your Response Format:
New Question: {Your transformed question}

Verification Criteria:

$\text{Correctness:}\quad c_\mathrm{cand} \geq T_\mathrm{min}, \qquad \text{Difficulty:}\quad c_\mathrm{cand} \leq c_\mathrm{ori} - \Delta_\mathrm{hard}$

Ablation and Quality Control:
- The pipeline performs multiple synthesis attempts for each seed.
- A supplementary LLM-based “judge” can flag redundant or low-quality augmentations, further filtering the candidate set.

3. Empirical Validation on Visual Math Reasoning

SynthRL was evaluated using the MMK12 dataset (8,072 open-ended visual math Q/A pairs):

Data Expansion: 3,380 new, strictly verified, and more difficult questions were synthesized and added, yielding a 42% expansion.
Difficulty Analysis: Synthesized questions exhibit a significantly lower mean pass rate and higher average reasoning step count (mean steps up by 33%), validating increased challenge and complexity.
Training Protocol: Models were trained using Group Relative Policy Optimization (GRPO), a specific RLVR method.

Performance on Out-of-Domain Benchmarks

SynthRL’s impact was measured on five rigorous visual reasoning benchmarks (MathVerse, MathVision, MathVista, WeMath, DynaMath):

Data	MathVerse	MathVision	MathVista	WeMath	DynaMath	Average
MMK12 (seed)	51.6	30.0	73.9	70.6	58.8	57.0
+SynthRL	53.5	29.6	74.2	72.6	60.1	58.0

Models trained with the augmented dataset outperformed the seed-only baseline on all but one benchmark, with greatest improvements observed in the most challenging evaluation subsets.
Gains increased with the scale of the seed set, suggesting strong benefits from larger and more diverse data pools.

Impact on Hard Samples

SynthRL preferentially raised accuracy on “medium” and “hard” benchmark examples as determined by Elo/Bradley-Terry ranking, with improvements of +1.7% and +1.6% at scale, respectively.
Accuracy on “easy” samples remained stable or, in some cases, slightly decreased due to the greater model focus on compositional reasoning.

4. Architectural Schematic

A schematic summarizing the SynthRL system:

Stage	Input	Operation	Guarantee	Output
1. Seed Selection	$(I, Q, A)$ , model policy $\pi$	Monte Carlo rollouts, select “easy” samples	Reliable answerability	“Easy” Q/A/Image tuples
2. Synthesis	Selected triplets	VLM-generated, harder question with same answer	Higher challenge	Candidate $(I, Q_\mathrm{cand}, A)$
3. Verification	Candidates, model policy $\pi$	Rollout-based correctness and difficulty tests	Answerable, more difficult	Verified, harder Q/A/Image tuples

5. Significance and Broader Impacts

SynthRL advances RLVR for VLMs by supporting:

Strictly verifiable, difficulty-controlled curriculum scaling, enabling models to develop more robust, compositional reasoning ability.
Automated generation and verification of challenging data, minimizing human annotation costs and reducing label noise—a problematic issue in generative data augmentation.
Practical curricula for reasoning over diverse inputs (images, equations, open-ended text), supporting transfer to broad, out-of-domain reasoning tasks.

This approach is shown to be robust across dataset sizes and is most effective when used to augment, not replace, core training data. A plausible implication is that as VLMs grow more powerful, such pipelines will be critical for pushing beyond current limitations in mathematical and multimodal reasoning.

6. Limitations and Directions for Future Research

Current data synthesis is constrained by the performance and diversity of the synthesizer model; further gains may come from integrating multiple synthesizer architectures or models.
Verification thresholds (minimum correct pass count, required increase in difficulty) may require retuning for different domains or levels of target model competence.
Extending the approach to more complex modalities or domain-specific reasoning tasks, as well as to even larger datasets and more challenging benchmarks, is an outlined path for future research.

SynthRL establishes a general template for scalable, verifiably correct, and difficulty-progressive synthesis in reinforcement learning, with demonstrated utility in state-of-the-art visual mathematical reasoning models.

PDF Markdown Chat (Upgrade)

References (1)

SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis (2025)