Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SynthRL: Scalable Data Synthesis for RLVR

Updated 30 June 2025
  • SynthRL is a scalable pipeline that generates verified, challenging training examples by augmenting easy seed questions while preserving answer consistency.
  • It uses a three-step process—difficulty-based seed selection, targeted question augmentation via VLMs, and stringent Monte Carlo-based verification—to ensure higher complexity.
  • Empirical results show that SynthRL expands datasets by 42% and boosts model performance on visual reasoning benchmarks, emphasizing its practical impact in RLVR tasks.

SynthRL refers to a scalable and verifiable pipeline for automatically synthesizing challenging data to improve reinforcement learning with verifiable reward (RLVR) in vision-LLMs, particularly for visual mathematical reasoning (2506.02096). SynthRL addresses the need for high-quality, difficult, and label-consistent training data in reasoning-oriented RL environments—where annotation costs are high and foundational models benefit most from harder, verifiably correct examples. It enables training curricula that are both enriched with more complex samples and rigorously free of label noise.

1. Design and Core Stages of SynthRL

SynthRL is defined by three sequential processes for synthesizing training data with guaranteed answer correctness and increased difficulty:

  1. Difficulty-based Seed Selection:
    • Selects “easy” questions from a labeled dataset—those which a target VLM policy answers correctly and reliably.
    • Evaluation is conducted via Monte Carlo rollout: each question is answered NN times with stochastic decoding; the number of correct answers (CpassC_\mathrm{pass}) denotes the model's ease in answering.
  2. Targeted Question Augmentation (Synthesis):
    • Each selected easy seed (I,Q,A)(I, Q, A) (image, question, answer) is transformed into a more challenging question QcandQ_\mathrm{cand}, conditioned to retain the same answer AA.
    • Augmentation is carried out by a powerful VLM (e.g., Gemini-2.5-Flash) acting as a synthesizer, given only the question and image without access to the ground-truth answer.
  3. Guaranteed Verification:
    • Each candidate (I,Qcand,A)(I, Q_\mathrm{cand}, A) is subject to strict model-based verification:
      • Correctness: The target VLM (used as a verifier) must answer QcandQ_\mathrm{cand} with AA at least TminT_\mathrm{min} times across NN rollouts, confirming answerability.
      • Difficulty: QcandQ_\mathrm{cand} must be noticeably more difficult for the model to answer, i.e., CpassC_\mathrm{pass} is reduced by at least Δhard\Delta_\mathrm{hard} relative to the original.
    • Only quadruples meeting both criteria are admitted to the expanded training set.

This methodology guarantees that the synthesized samples are both reliably answerable and increase the demand for compositional or deeper reasoning.

2. Technical Implementation Details

  • Seed Selection Formula:

Cpass(I,Q,A;π)=j=1NI(Apred(j)=A)C_\mathrm{pass}(I, Q, A; \pi) = \sum_{j=1}^N \mathbb{I}(A^{(j)}_\text{pred} = A)

Questions with CpassTC_\mathrm{pass} \geq T are selected; typical N=16N=16, T=12T=12.

  • Augmentation Prompt Example:

1
2
3
4
Given an image and the following question, transform it into a significantly more challenging version that requires deeper reasoning but maintains the same answer.
Original Question: {question}
Your Response Format:
New Question: {Your transformed question}

  • Verification Criteria:

Correctness:ccandTmin,Difficulty:ccandcoriΔhard\text{Correctness:}\quad c_\mathrm{cand} \geq T_\mathrm{min}, \qquad \text{Difficulty:}\quad c_\mathrm{cand} \leq c_\mathrm{ori} - \Delta_\mathrm{hard}

  • Ablation and Quality Control:
    • The pipeline performs multiple synthesis attempts for each seed.
    • A supplementary LLM-based “judge” can flag redundant or low-quality augmentations, further filtering the candidate set.

3. Empirical Validation on Visual Math Reasoning

SynthRL was evaluated using the MMK12 dataset (8,072 open-ended visual math Q/A pairs):

  • Data Expansion: 3,380 new, strictly verified, and more difficult questions were synthesized and added, yielding a 42% expansion.
  • Difficulty Analysis: Synthesized questions exhibit a significantly lower mean pass rate and higher average reasoning step count (mean steps up by 33%), validating increased challenge and complexity.
  • Training Protocol: Models were trained using Group Relative Policy Optimization (GRPO), a specific RLVR method.

Performance on Out-of-Domain Benchmarks

SynthRL’s impact was measured on five rigorous visual reasoning benchmarks (MathVerse, MathVision, MathVista, WeMath, DynaMath):

Data MathVerse MathVision MathVista WeMath DynaMath Average
MMK12 (seed) 51.6 30.0 73.9 70.6 58.8 57.0
+SynthRL 53.5 29.6 74.2 72.6 60.1 58.0
  • Models trained with the augmented dataset outperformed the seed-only baseline on all but one benchmark, with greatest improvements observed in the most challenging evaluation subsets.
  • Gains increased with the scale of the seed set, suggesting strong benefits from larger and more diverse data pools.

Impact on Hard Samples

  • SynthRL preferentially raised accuracy on “medium” and “hard” benchmark examples as determined by Elo/Bradley-Terry ranking, with improvements of +1.7% and +1.6% at scale, respectively.
  • Accuracy on “easy” samples remained stable or, in some cases, slightly decreased due to the greater model focus on compositional reasoning.

4. Architectural Schematic

A schematic summarizing the SynthRL system:

Stage Input Operation Guarantee Output
1. Seed Selection (I,Q,A)(I, Q, A), model policy π\pi Monte Carlo rollouts, select “easy” samples Reliable answerability “Easy” Q/A/Image tuples
2. Synthesis Selected triplets VLM-generated, harder question with same answer Higher challenge Candidate (I,Qcand,A)(I, Q_\mathrm{cand}, A)
3. Verification Candidates, model policy π\pi Rollout-based correctness and difficulty tests Answerable, more difficult Verified, harder Q/A/Image tuples

5. Significance and Broader Impacts

SynthRL advances RLVR for VLMs by supporting:

  • Strictly verifiable, difficulty-controlled curriculum scaling, enabling models to develop more robust, compositional reasoning ability.
  • Automated generation and verification of challenging data, minimizing human annotation costs and reducing label noise—a problematic issue in generative data augmentation.
  • Practical curricula for reasoning over diverse inputs (images, equations, open-ended text), supporting transfer to broad, out-of-domain reasoning tasks.

This approach is shown to be robust across dataset sizes and is most effective when used to augment, not replace, core training data. A plausible implication is that as VLMs grow more powerful, such pipelines will be critical for pushing beyond current limitations in mathematical and multimodal reasoning.

6. Limitations and Directions for Future Research

  • Current data synthesis is constrained by the performance and diversity of the synthesizer model; further gains may come from integrating multiple synthesizer architectures or models.
  • Verification thresholds (minimum correct pass count, required increase in difficulty) may require retuning for different domains or levels of target model competence.
  • Extending the approach to more complex modalities or domain-specific reasoning tasks, as well as to even larger datasets and more challenging benchmarks, is an outlined path for future research.

SynthRL establishes a general template for scalable, verifiably correct, and difficulty-progressive synthesis in reinforcement learning, with demonstrated utility in state-of-the-art visual mathematical reasoning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)