Video-R1: Reinforcing Video Reasoning in MLLMs (2503.21776v3)

Published 27 Mar 2025 in cs.CV

Abstract: Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal LLMs (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.

Summary

Reinforcing Video Reasoning in Multimodal LLMs: A Detailed Examination of Video-R1

The work titled "Video-R1: Reinforcing Video Reasoning in MLLMs" explores the potential of reinforcement learning (RL) to enhance reasoning capabilities in multimodal LLMs (MLLMs) through video data. The authors build upon the R1 paradigm initiated by DeepSeek-R1, a line of research demonstrating the effectiveness of RL in eliciting advanced reasoning skills in LLMs using rule-based techniques. This paper extends these insights to MLLMs by introducing Video-R1, aiming to systematically address the challenges of video reasoning, a domain previously underexplored.

Two major challenges are highlighted in adapting RL to video reasoning. Firstly, there is an inherent lack of temporal modeling in existing RL methodologies when applied to video, as the Group Relative Policy Optimization (GRPO) algorithm lacks mechanisms to reward temporal reasoning. Secondly, there is a scarcity of high-quality datasets that adequately capture complex video reasoning tasks. To tackle temporal modeling, the authors propose a novel Temporal Group Relative Policy Optimization (T-GRPO) algorithm. This approach leverages a contrastive training technique to explicitly encourage temporal reasoning by rewarding the model when it performs better on temporally ordered sequences than on shuffled ones.

In a remarkable methodological step, the authors incorporate image-based reasoning data to supplement insufficient video reasoning datasets. By constructing two comprehensive datasets—Video-R1-COT-165k and Video-R1-260k—they create a robust training environment comprising both image and video reasoning samples. This hybrid data approach facilitates both the cold-start supervised fine-tuning (SFT) and reinforcement learning phases necessary to train and evaluate their model. The integration of image-based data contributes reasoning skills that enhance the model's adaptability to dynamic video contexts.

The empirical results on various benchmarks underscore the effectiveness of Video-R1. Particularly, the performance of their model, Video-R1-7B, on the VSI-Bench benchmark exemplifies its video reasoning capabilities, achieving a 35.8% accuracy, a notable improvement over even commercial proprietary models like GPT-4o. This substantiates the authors' claim that their RL framework significantly improves MLLMs' ability to understand and reason through videos compared to prior attempts that were mostly limited to video perception tasks.

The introduction of mechanisms such as the T-GRPO marks a significant evolution in video reasoning approaches, addressing shortcomings in existing RL techniques that often lead MLLMs to suboptimal reasoning shortcuts. By including essential insights about the utility of temporal order in reasoning, their contribution paves the way for further research in this domain. Moreover, the notion of "aha moments," where models display reflective reasoning behavior, hints at the advanced comprehension potential now attainable through such methods.

While the key contributions of this paper advance video reasoning in MLLMs, the authors also identify limitations and potential areas for future research: increasing the frame number for more exhaustive temporal reasoning, refining temporal modeling methods to reduce computation overhead, and exploring adaptive response length controls to maintain reasoning conciseness.

Overall, "Video-R1: Reinforcing Video Reasoning in MLLMs" represents a meaningful stride towards equipping MLLMs with robust reasoning capabilities. By effectively combining innovative reinforcement learning algorithms with strategic dataset designs, this work lays a foundational framework for sophisticated video reasoning tasks, offering rich avenues for future exploration in the field of artificial intelligence.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1905448092750540944

https://twitter.com/LLMSherpa/status/1906353507881357362

https://twitter.com/TheTuringPost/status/1907074592415637850

https://twitter.com/ArxivToday/status/1905665994971897995

HackerNews

Video-R1: Reinforcing Video Reasoning in MLLMs (1 point, 0 comments)