Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models (2503.06749v2)

Published 9 Mar 2025 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .

Summary

An Overview of "Vision-R1: Incentivizing Reasoning Capability in Multimodal LLMs"

The paper "Vision-R1: Incentivizing Reasoning Capability in Multimodal LLMs" addresses a notable challenge in advancing Multimodal LLMs (MLLMs) — enhancing complex reasoning capabilities via Reinforcement Learning (RL). The authors attempt to build upon the established success of reasoning in LLMs by utilizing RL strategies catered for multimodal contexts. This essay will delineate the methodological framework, numerical highlights, and implications of this research.

Methodological Insights

The paper introduces Vision-R1, a reasoning-augmented MLLM developed through a novel approach combining cold-start initialization with RL training. Initial efforts to apply RL directly to MLLMs, modeled after successful text-based applications (such as DeepSeek-R1-Zero), faced challenges due to insufficient high-quality multimodal datasets. To tackle this, the authors devised a cold-start methodology utilizing an intricate dataset creation process called Modality Bridging. This strategy leverages Pseudo-CoT reasoning structures generated by MLLMs to transition multimodal inputs into high-quality text forms readable by DeepSeek-R1, ultimately producing a dataset that conveys rich, human-like cognitive processes.

Upon creating the Multimodal CoT dataset, dubbed Vision-R1-cold, the MLLM undergoes cold-start initialization. This ensures it acquires fundamental reasoning structures before proceeding to the optimization stage. The latter part involves Progressive Thinking Suppression Training (PTST) integrated within a GRPO framework, which progressively lengthens reasoning processes — thus refining both the complexity and correctness of output. This systematic combination allows the Vision-R1 model to surpass traditional RL applications, avoiding pitfalls such as the Overthinking Optimization Problem highlighted during trials.

Numerical Highlights

The empirical results are substantial. Vision-R1-7B demonstrated an approximate 6% improvement across diverse multimodal math reasoning benchmarks. Notably, on the stringent MathVista benchmark — encompassing tasks that require substantial reasoning sophistication — Vision-R1 recorded a 73.5% accuracy, placing it only 0.4% behind OpenAI's top-performing reasoning model O1. This remarkable performance, achieved with considerably fewer parameters (7B compared to competitors' 70B+), illustrates the potency of reasoning-oriented RL combined with well-initialized datasets.

Implications and Future Directions

The implications of this research are multifaceted. Practically, Vision-R1 sets a precedent for successfully harnessing RL to enhance reasoning in multimodal contexts, which could significantly impact applications in areas requiring complex decision-making and problem solving, such as scientific analysis, technical diagnostics, and educational tools. From a theoretical perspective, it catalyzes further investigation into the cognitive processes embedded within AI systems, potentially informing designs that approach AGI (Artificial General Intelligence).

Looking forward, this work raises the prospect of refining RL reward mechanisms and exploring additional modalities beyond vision and text. Continuous improvements in dataset curation using techniques like Modality Bridging may unlock new capabilities, complementing rapidly evolving AI architectures. This research exemplifies a step towards more intelligent reasoning systems, offering both insights and tangible advancements within AI’s multimodal landscape.

Overall, "Vision-R1: Incentivizing Reasoning Capability in Multimodal LLMs" not only demonstrates technical finesse but also broadens the horizon for future innovations in the integration of reasoning capabilities across AI systems.

Related Papers

GitHub

GitHub - Osilly/Vision-R1: This is the first paper to explore how to use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-start initialization and RL training to incentivize reasoning capability. (28 stars)

Tweets

https://twitter.com/kylekabasares/status/1899515070234927299

https://twitter.com/ITica007/status/1899509350646882344

YouTube

Show All Videos

Reddit

[2503.06749] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models (1 point, 0 comments)