Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
99 tokens/sec
Gemini 2.5 Pro Premium
56 tokens/sec
GPT-5 Medium
26 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
106 tokens/sec
DeepSeek R1 via Azure Premium
99 tokens/sec
GPT OSS 120B via Groq Premium
507 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation (2504.13055v3)

Published 17 Apr 2025 in cs.CV

Abstract: Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-LLMs (VLMs). However, enhancing policy exploration to better scale test-time compute remains largely underexplored. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. To this end, we propose NoisyRollout, a simple yet effective data augmentation method that mixes trajectories from both clean and moderately distorted images during RL training. By injecting targeted diversity in visual perception and the resulting reasoning patterns, NoisyRollout promotes better policy exploration through vision-oriented inductive biases, ultimately leading to more robust reasoning behaviors. We further adopt a noise annealing schedule that gradually reduces distortion strength over training, leveraging noisy signals early on while ensuring training stability in later stages. Crucially, our method is easy-to-adopt--requiring no additional training cost and no modifications to the RL objective. Extensive experiments on $2$ distinct training datasets demonstrate that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across $5$ out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across model sizes ($7$B and $32$B) and data scales (from $1$K to $6$K), highlighting its generalizability and scalability.

Summary

  • The paper introduces NoisyRollout, a hybrid rollout strategy that combines clean and noisy trajectories to improve visual reasoning in VLMs.
  • It leverages a noise annealing schedule and reinforcement learning fine-tuning on 2.1K samples to boost generalization across multiple out-of-domain benchmarks.
  • Results demonstrate enhanced sample efficiency and cost-effectiveness by achieving state-of-the-art performance without additional computational overhead.

This paper introduces NoisyRollout, a reinforcement learning (RL) fine-tuning technique designed to enhance the reasoning and perception capabilities of Vision-LLMs (VLMs) (2504.13055). It tackles two main challenges: ineffective policy exploration in RL for VLMs and the models' struggles with imperfect visual perception, which hinders complex reasoning.

The core idea of NoisyRollout is to introduce targeted diversity during the RL training's rollout phase by using trajectories generated from both clean and moderately distorted images. This is implemented as a Hybrid Rollout Strategy on top of the Group Relative Policy Optimization (GRPO) algorithm.

Implementation Details:

  1. Hybrid Rollout:
    • For each training sample (Image II, Query q\mathbf{q}), generate a distorted version I~=Tαt(I)\tilde{I} = T_{\alpha_t}(I) using a noise function TT (e.g., Gaussian noise) with strength αt\alpha_t.
    • The old policy πθold\pi_{\theta_{\mathrm{old}}} generates n1n_1 trajectories (rollouts) using the clean input (I,q)(I, \mathbf{q}) and n2n_2 trajectories using the noisy input (I~,q)(\tilde{I}, \mathbf{q}).
    • All n1+n2n_1 + n_2 trajectories are used together to calculate the reward baseline Rˉ\bar{R} and the normalized advantages A^i\hat{A}_i within the GRPO framework.
    • Crucially, the policy update step for the current policy πθ\pi_{\theta} is performed only using the trajectories conditioned on the clean image II. The objective function remains the standard GRPO loss, but calculated using advantages derived from the mixed clean/noisy rollouts.

    $L(\theta) = \mathbb{E}\Biggl[ \frac{1}{n_1 + n_2}\sum_{i=1}^{n_1 + n_2} \min\ \!\Biggl(\frac{\pi_{\theta}(\mathbf{o}_i \mid I,\mathbf{q})}{\pi_{\theta_{\mathrm{old}}(\mathbf{o}_i \mid I,\mathbf{q})}\hat{A}_i, \mathrm{clip}\ \!\Bigl(\frac{\pi_{\theta}(\mathbf{o}_i \mid I,\mathbf{q})}{\pi_{\theta_{\mathrm{old}}(\mathbf{o}_i \mid I,\mathbf{q})},\,1-\epsilon,\,1+\epsilon\Bigr)\hat{A}_i \Biggr) \Biggr]$

    (Note: The policy πθ\pi_{\theta} is always evaluated conditioned on the clean image II, even when calculating the ratio for advantages A^i\hat{A}_i derived from noisy rollouts).

  2. Noise Annealing Schedule:

    • To maintain training stability, especially in later stages, the noise strength αt\alpha_t is gradually reduced over training steps tt.
    • The paper uses a sigmoid schedule: αt=α0(111+eλ(t/Tmaxγ))\alpha_t = \alpha_0 \cdot \left(1 - \frac{1}{1 + e^{-\lambda(t / T_{\text{max}} - \gamma)}}\right).
    • This allows the model to benefit from diverse, noisy signals early on and transitions towards more on-policy updates later, mitigating distributional mismatch.
  3. Setup:
    • Base Model: Qwen2.5-VL-7B-Instruct
    • RL Framework: EasyR1
    • Training Data: 2.1K samples from Geometry3K or K12 datasets.
    • Noise: Gaussian noise with initial α0=500\alpha_0 = 500 (Geo3K) or $450$ (K12).
    • Rollouts: n1=6n_1 = 6, n2=6n_2 = 6.
    • Annealing Params: γ=2/3\gamma = 2/3, λ=30\lambda = 30.
    • Training: 8 A100-40G GPUs, batch size 128, 15 episodes, 60 optimization steps. Vision encoder frozen, KL divergence constraint omitted.

Algorithm Overview:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def noisy_rollout_step(theta, theta_old, batch, n1, n2, alpha_t, T_noise, R_func, epsilon, optimizer):
    I_batch, q_batch = batch

    # 1. Generate noisy images
    I_tilde_batch = T_noise(I_batch, alpha_t)

    # 2. Generate rollouts with old policy
    with torch.no_grad():
        # Clean rollouts
        clean_rollouts_o = sample_policy(pi_theta_old, I_batch, q_batch, n=n1) # List of n1 responses per sample
        # Noisy rollouts
        noisy_rollouts_o = sample_policy(pi_theta_old, I_tilde_batch, q_batch, n=n2) # List of n2 responses per sample

    # Combine rollouts for reward/advantage calculation
    all_rollouts_o = clean_rollouts_o + noisy_rollouts_o # List of n1+n2 responses per sample

    # 3. Compute rewards and advantages (based on clean image I)
    rewards = [R_func(I, q, o) for (I, q) in batch for o in all_rollouts_o_for_sample]
    rewards_per_sample = rewards.view(batch_size, n1 + n2)
    baseline_R = rewards_per_sample.mean(dim=1, keepdim=True)
    std_R = rewards_per_sample.std(dim=1, keepdim=True) + 1e-8 # Avoid division by zero
    advantages = (rewards_per_sample - baseline_R) / std_R
    advantages = advantages.flatten() # Shape: (batch_size * (n1+n2))

    # 4. Compute GRPO loss (conditioned ONLY on clean images I)
    log_probs_theta_old = calculate_log_probs(pi_theta_old, I_batch, q_batch, all_rollouts_o)
    log_probs_theta = calculate_log_probs(pi_theta, I_batch, q_batch, all_rollouts_o) # IMPORTANT: pi_theta sees clean I

    ratios = torch.exp(log_probs_theta - log_probs_theta_old.detach())
    surr1 = ratios * advantages.detach()
    surr2 = torch.clamp(ratios, 1.0 - epsilon, 1.0 + epsilon) * advantages.detach()
    loss = -torch.min(surr1, surr2).mean()

    # 5. Update policy
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Update old policy (e.g., theta_old = theta)
    return loss.item()

Key Results and Practical Implications:

  • Improved Generalization: Achieves SOTA results among open-source RL-tuned models on 5 out-of-domain reasoning and perception benchmarks (MathVerse, MathVision, MathVista, WeMath, HallusionBench) using only 2.1K training samples.
  • Enhanced Perception: Notably improves performance on HallusionBench, mitigating the perception degradation often seen when using reasoning templates. The hybrid rollout implicitly provides contrastive signals refining perception.
  • Sample Efficiency: Generalizes well from significantly less RL data (2.1K) compared to other methods requiring large SFT datasets (e.g., 155K SFT + 10K RL for R1-OneVision-7B) or more RL data (15K for MM-Eureka).
  • Cost-Effective: Requires no extra training cost or modifications to the RL objective, making it a "free lunch" addition to standard GRPO training pipelines.
  • Robustness: Shows consistent improvements when trained on different datasets (Geometry3K, K12) and is compatible with Dr.GRPO (an unbiased GRPO variant).
  • Targeted Exploration: Ablations show NoisyRollout provides more effective rollout diversity for visual reasoning compared to simply increasing rollout temperature, which introduces more general, less targeted variability.
  • Stability: The noise annealing schedule is crucial for preventing divergence and achieving good performance. Ablations without it showed sharp performance drops and lower final scores.

Limitations/Unsuccessful Attempts:

  • Optimizing policy updates based on noisy images (instead of clean images) did not yield improvements.
  • Other image augmentations like cropping or rotation often led to information loss and training instability. Gaussian noise was found to be a better regularizer.
  • Adding explicit reward penalties for noisy rollouts caused the model to learn to distinguish noisy inputs rather than improve reasoning, leading to divergence.

In summary, NoisyRollout presents a practical and effective method for enhancing VLM reasoning and perception through RL. By simply mixing rollouts from clean and noisy images during training (while only updating the policy based on clean images) and using a noise annealing schedule, it significantly boosts generalization and sample efficiency without added computational overhead during the policy update step.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.