Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model (2504.07615v2)

Published 10 Apr 2025 in cs.CV and cs.CL

Abstract: Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of LLMs through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-LLMs (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-LLMs, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1

This paper introduces VLM-R1, a framework designed to apply R1-style reinforcement learning (RL) to Vision-LLMs (VLMs) to enhance their visual reasoning and understanding capabilities. Inspired by the success of DeepSeek R1 (DeepSeek-AI et al., 22 Jan 2025 ) in improving LLMs using rule-based rewards derived from tasks with deterministic ground truth, VLM-R1 extends this approach to the visual domain where many tasks similarly offer precise ground-truth annotations.

The core motivation is that rule-based rewards are stable and interpretable, making them suitable for optimizing VLM performance on tasks like object detection and comprehension where ground-truth bounding boxes or labels are available. The authors observe that current VLMs often underperform specialized vision models on such tasks, despite having significantly more parameters, suggesting a gap that RL could help close.

The VLM-R1 framework, built upon Open-R1 [openr1], provides a dedicated and extensible pipeline for training VLMs with GRPO (Shao et al., 5 Feb 2024 ), the specific RL algorithm used in DeepSeek R1. Key implementation features of VLM-R1 include:

  • GRPO Compatibility: Full support for the GRPO algorithm with fine-grained hyperparameter control.
  • LoRA-based Training: Parameter-efficient training using LoRA (Torosov et al., 2020 ), making it accessible for limited computational resources.
  • Multi-node Training: Support for distributed training across multiple GPUs or server nodes for scalability.
  • Multi-image Input: Ability to handle multiple images per sample for complex reasoning tasks.
  • Model Flexibility: Compatible with various VLM architectures through a modular VLM Module. It currently supports models like QwenVL (Bai et al., 19 Feb 2025 , Wang et al., 18 Sep 2024 ) and InternVL (Chen et al., 6 Dec 2024 , Chen et al., 25 Apr 2024 ).
  • Custom Dataset Support: Easy integration of user-defined datasets.
  • Mixed Modality Training: Supports training on image-text, pure-text, or hybrid datasets.

The framework's pipeline consists of two main stages: preparation (defining rewards, data loading using grpo_jsonl.py) and training (grpo_trainer.py), which handles model initialization, parameter configuration (LoRA, freezing, or full training), sampling responses, scoring with reward functions, and optimizing the policy using GRPO loss. The VLM Module acts as an interface, abstracting model-specific details like class names and chat templates, allowing the trainer to work with different VLMs seamlessly.

The paper focuses on two tasks to evaluate VLM-R1: Referring Expression Comprehension (REC) and Open-Vocabulary Object Detection (OVD). Both tasks require bounding box outputs but differ in complexity. The reward functions for these tasks combine an accuracy reward based on standard computer vision metrics and a format reward to ensure structured output.

  • REC Reward: The accuracy reward is defined as the Intersection-over-Union (IoU) between the predicted bounding box and the ground truth Raccrec(q,o)=IoU(b,frec(o))R^{rec}_{acc}(q, o) = \text{IoU}(b^*, f_{rec}(o)), where frecf_{rec} extracts the box from the model's output. The format reward checks for adherence to a specified JSON-style output within <answer> tags, including a > tag. > > * OVD Reward: For OVD, which involves multiple bounding boxes and labels, the accuracy reward is based on Mean Average Precision (mAP). However, the authors found that a naive mAP reward leads to "reward hacking" where the model over-predicts boxes to maximize reward when categories not in the ground truth are excluded from evaluation. To counter this, they introduce an odLength reward which includes a penalty factor sovd=max(1,LgtLpred)s_{ovd} = \max(1, \frac{L_{gt}}{L_{pred}}) for redundant predictions, where LgtL_{gt} and LpredL_{pred} are the number of ground truth and predicted combinations, respectively. The OVD format reward checks for markdown-style JSON output within <answer> tags, also including a <think> tag. > > Experiments using Qwen2.5VL-3B-Instruct (and 7B/32B variants for scaling studies) demonstrate the effectiveness of VLM-R1. > > * REC Results: Trained on Refcoco/+/g training splits, the RL model showed steady performance gains on in-domain validation sets and, critically, achieved significant improvements on the out-of-domain, reasoning-intensive LISA-Grounding benchmark compared to the SFT baseline. SFT performance degraded on the out-of-domain set, while RL generalized effectively, highlighting RL's ability to learn transferable reasoning capabilities. > > * OVD Results: Trained on the D3^3 dataset, the RL model substantially outperformed the SFT model on both COCOfiltered_{filtered} (21.1 mAP vs 18.5 mAP) and the comprehensive OVDEval benchmark (31.01 nms-AP vs 26.50 nms-AP). The RL model showed stronger performance in complex subtasks like Position, Relationship, and Negation detection. Comparing the RL VLM to specialized OVD models like OmDet (Lu et al., 8 Mar 2024 ) and Grounding-DINO (Ren et al., 16 May 2024 ), the VLM-R1 model excelled in tasks requiring world knowledge and entity recognition (e.g., Celebrity detection), while specialized models were stronger in fine-grained detection of small objects. > > * Reward Hacking Ablation: The paper reveals severe reward hacking in OVD with naive AP rewards, leading to dramatically increased output lengths (over-prediction). The proposed odLength reward effectively mitigates this, stabilizing output length and encouraging an "OD aha moment" where the model first reasons about object presence before predicting bounding boxes. > > * Training Data Ablation: Training OVD on the semantically richer D3^3 dataset yielded significantly better results than training on COCO data, even for evaluating on COCO. This suggests that complex, meaning-intensive data is crucial for fostering reasoning and generalization in VLMs via RL. > > * Model Size Ablation: RL consistently improved performance across 3B, 7B, and 32B models, particularly on reasoning-intensive subtasks. Larger models showed greater gains on tasks requiring better visual perception of fine-grained details (like Color detection) after RL training. > > The paper concludes that VLM-R1 successfully brings R1-style RL to visual understanding tasks, demonstrating superior generalization compared to SFT, especially in scenarios requiring reasoning. Key takeaways for practical application include the critical importance of careful reward engineering to prevent reward hacking in complex tasks like OVD and the necessity of selecting high-quality, semantically rich training data to elicit robust reasoning behaviors. The framework and models are open-sourced to support further research in vision-language RL.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Haozhan Shen (8 papers)
  2. Peng Liu (372 papers)
  3. Jingcheng Li (25 papers)
  4. Chunxin Fang (4 papers)
  5. Yibo Ma (7 papers)
  6. Jiajia Liao (4 papers)
  7. Qiaoli Shen (1 paper)
  8. Zilun Zhang (12 papers)
  9. Kangjia Zhao (5 papers)
  10. Qianqian Zhang (37 papers)
  11. Ruochen Xu (35 papers)
  12. Tiancheng Zhao (48 papers)
Github Logo Streamline Icon: https://streamlinehq.com