NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation (2504.13055v3)

Published 17 Apr 2025 in cs.CV

Abstract: Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-LLMs (VLMs). However, enhancing policy exploration to better scale test-time compute remains largely underexplored. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. To this end, we propose NoisyRollout, a simple yet effective data augmentation method that mixes trajectories from both clean and moderately distorted images during RL training. By injecting targeted diversity in visual perception and the resulting reasoning patterns, NoisyRollout promotes better policy exploration through vision-oriented inductive biases, ultimately leading to more robust reasoning behaviors. We further adopt a noise annealing schedule that gradually reduces distortion strength over training, leveraging noisy signals early on while ensuring training stability in later stages. Crucially, our method is easy-to-adopt--requiring no additional training cost and no modifications to the RL objective. Extensive experiments on $2$ distinct training datasets demonstrate that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across $5$ out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across model sizes ($7$B and $32$B) and data scales (from $1$K to $6$K), highlighting its generalizability and scalability.

Summary

The paper introduces NoisyRollout, a hybrid rollout strategy that combines clean and noisy trajectories to improve visual reasoning in VLMs.
It leverages a noise annealing schedule and reinforcement learning fine-tuning on 2.1K samples to boost generalization across multiple out-of-domain benchmarks.
Results demonstrate enhanced sample efficiency and cost-effectiveness by achieving state-of-the-art performance without additional computational overhead.

This paper introduces NoisyRollout, a reinforcement learning (RL) fine-tuning technique designed to enhance the reasoning and perception capabilities of Vision-LLMs (VLMs) (2504.13055). It tackles two main challenges: ineffective policy exploration in RL for VLMs and the models' struggles with imperfect visual perception, which hinders complex reasoning.

The core idea of NoisyRollout is to introduce targeted diversity during the RL training's rollout phase by using trajectories generated from both clean and moderately distorted images. This is implemented as a Hybrid Rollout Strategy on top of the Group Relative Policy Optimization (GRPO) algorithm.

Implementation Details:

Hybrid Rollout:
- For each training sample (Image $I$ , Query $\mathbf{q}$ ), generate a distorted version $\tilde{I} = T_{\alpha_t}(I)$ using a noise function $T$ (e.g., Gaussian noise) with strength $\alpha_t$ .
- The old policy $\pi_{\theta_{\mathrm{old}}}$ generates $n_1$ trajectories (rollouts) using the clean input $(I, \mathbf{q})$ and $n_2$ trajectories using the noisy input $(\tilde{I}, \mathbf{q})$ .
- All $n_1 + n_2$ trajectories are used together to calculate the reward baseline $\bar{R}$ and the normalized advantages $\hat{A}_i$ within the GRPO framework.
- Crucially, the policy update step for the current policy $\pi_{\theta}$ is performed only using the trajectories conditioned on the clean image $I$ . The objective function remains the standard GRPO loss, but calculated using advantages derived from the mixed clean/noisy rollouts.
$L(\theta) = \mathbb{E}\Biggl[ \frac{1}{n_1 + n_2}\sum_{i=1}^{n_1 + n_2} \min\ \!\Biggl(\frac{\pi_{\theta}(\mathbf{o}_i \mid I,\mathbf{q})}{\pi_{\theta_{\mathrm{old}}(\mathbf{o}_i \mid I,\mathbf{q})}\hat{A}_i, \mathrm{clip}\ \!\Bigl(\frac{\pi_{\theta}(\mathbf{o}_i \mid I,\mathbf{q})}{\pi_{\theta_{\mathrm{old}}(\mathbf{o}_i \mid I,\mathbf{q})},\,1-\epsilon,\,1+\epsilon\Bigr)\hat{A}_i \Biggr) \Biggr]$

(Note: The policy $\pi_{\theta}$ is always evaluated conditioned on the clean image $I$ , even when calculating the ratio for advantages $\hat{A}_i$ derived from noisy rollouts).
Noise Annealing Schedule:
- To maintain training stability, especially in later stages, the noise strength $\alpha_t$ is gradually reduced over training steps $t$ .
- The paper uses a sigmoid schedule: $\alpha_t = \alpha_0 \cdot \left(1 - \frac{1}{1 + e^{-\lambda(t / T_{\text{max}} - \gamma)}}\right)$ .
- This allows the model to benefit from diverse, noisy signals early on and transitions towards more on-policy updates later, mitigating distributional mismatch.
Setup:
- Base Model: Qwen2.5-VL-7B-Instruct
- RL Framework: EasyR1
- Training Data: 2.1K samples from Geometry3K or K12 datasets.
- Noise: Gaussian noise with initial $\alpha_0 = 500$ (Geo3K) or $450$ (K12).
- Rollouts: $n_1 = 6$ , $n_2 = 6$ .
- Annealing Params: $\gamma = 2/3$ , $\lambda = 30$ .
- Training: 8 A100-40G GPUs, batch size 128, 15 episodes, 60 optimization steps. Vision encoder frozen, KL divergence constraint omitted.

Algorithm Overview:

def noisy_rollout_step(theta, theta_old, batch, n1, n2, alpha_t, T_noise, R_func, epsilon, optimizer):
    I_batch, q_batch = batch

    # 1. Generate noisy images
    I_tilde_batch = T_noise(I_batch, alpha_t)

    # 2. Generate rollouts with old policy
    with torch.no_grad():
        # Clean rollouts
        clean_rollouts_o = sample_policy(pi_theta_old, I_batch, q_batch, n=n1) # List of n1 responses per sample
        # Noisy rollouts
        noisy_rollouts_o = sample_policy(pi_theta_old, I_tilde_batch, q_batch, n=n2) # List of n2 responses per sample

    # Combine rollouts for reward/advantage calculation
    all_rollouts_o = clean_rollouts_o + noisy_rollouts_o # List of n1+n2 responses per sample

    # 3. Compute rewards and advantages (based on clean image I)
    rewards = [R_func(I, q, o) for (I, q) in batch for o in all_rollouts_o_for_sample]
    rewards_per_sample = rewards.view(batch_size, n1 + n2)
    baseline_R = rewards_per_sample.mean(dim=1, keepdim=True)
    std_R = rewards_per_sample.std(dim=1, keepdim=True) + 1e-8 # Avoid division by zero
    advantages = (rewards_per_sample - baseline_R) / std_R
    advantages = advantages.flatten() # Shape: (batch_size * (n1+n2))

    # 4. Compute GRPO loss (conditioned ONLY on clean images I)
    log_probs_theta_old = calculate_log_probs(pi_theta_old, I_batch, q_batch, all_rollouts_o)
    log_probs_theta = calculate_log_probs(pi_theta, I_batch, q_batch, all_rollouts_o) # IMPORTANT: pi_theta sees clean I

    ratios = torch.exp(log_probs_theta - log_probs_theta_old.detach())
    surr1 = ratios * advantages.detach()
    surr2 = torch.clamp(ratios, 1.0 - epsilon, 1.0 + epsilon) * advantages.detach()
    loss = -torch.min(surr1, surr2).mean()

    # 5. Update policy
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Update old policy (e.g., theta_old = theta)
    return loss.item()

Key Results and Practical Implications:

Improved Generalization: Achieves SOTA results among open-source RL-tuned models on 5 out-of-domain reasoning and perception benchmarks (MathVerse, MathVision, MathVista, WeMath, HallusionBench) using only 2.1K training samples.
Enhanced Perception: Notably improves performance on HallusionBench, mitigating the perception degradation often seen when using reasoning templates. The hybrid rollout implicitly provides contrastive signals refining perception.
Sample Efficiency: Generalizes well from significantly less RL data (2.1K) compared to other methods requiring large SFT datasets (e.g., 155K SFT + 10K RL for R1-OneVision-7B) or more RL data (15K for MM-Eureka).
Cost-Effective: Requires no extra training cost or modifications to the RL objective, making it a "free lunch" addition to standard GRPO training pipelines.
Robustness: Shows consistent improvements when trained on different datasets (Geometry3K, K12) and is compatible with Dr.GRPO (an unbiased GRPO variant).
Targeted Exploration: Ablations show NoisyRollout provides more effective rollout diversity for visual reasoning compared to simply increasing rollout temperature, which introduces more general, less targeted variability.
Stability: The noise annealing schedule is crucial for preventing divergence and achieving good performance. Ablations without it showed sharp performance drops and lower final scores.

Limitations/Unsuccessful Attempts:

Optimizing policy updates based on noisy images (instead of clean images) did not yield improvements.
Other image augmentations like cropping or rotation often led to information loss and training instability. Gaussian noise was found to be a better regularizer.
Adding explicit reward penalties for noisy rollouts caused the model to learn to distinguish noisy inputs rather than improve reasoning, leading to divergence.

In summary, NoisyRollout presents a practical and effective method for enhancing VLM reasoning and perception through RL. By simply mixing rollouts from clean and noisy images during training (while only updating the policy based on clean images) and using a noise annealing schedule, it significantly boosts generalization and sample efficiency without added computational overhead during the policy update step.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (8)

Tweets

https://twitter.com/TianyuPang1/status/1913270068424974775

https://twitter.com/NiJinjie/status/1913279921125994714

https://twitter.com/_akhaliq/status/1913158133372907548

https://twitter.com/ceobillionaire/status/1913696381182918841

https://twitter.com/LimooTorch/status/1913635680074219650

https://twitter.com/arxivsanitybot/status/1913421771073638667