VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning (2504.06958v3)

Published 9 Apr 2025 in cs.CV

Abstract: Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal LLMs (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.

PDF Abstract

Overview of VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

The paper presents VideoChat-R1, a multimodal LLM (MLLM) designed to improve spatio-temporal perception through Reinforcement Fine-Tuning (RFT) combined with Group Relative Policy Optimization (GRPO). The primary objective of this research is to enhance video understanding capabilities, specifically targeting spatio-temporal reasoning without compromising the general chat functionalities of the model.

Recent developments in reinforcement learning, particularly in the context of reasoning tasks, have predominantly focused on text and image-based applications. While models like GRPO have demonstrated significant success in these domains, their adaptation to video understanding remains relatively underexplored. This paper attempts to bridge this gap by applying GRPO to video MLLMs, aiming for task-specific enhancements that are both data-efficient and capable of improving performance across various spatio-temporal tasks.

Key Findings

The authors report substantial improvements in model performance on specialized video tasks such as temporal grounding and object tracking. VideoChat-R1 shows significant performance boosts when contrasted with its predecessor Qwen2.5-VL-7B across different benchmarks. Notably, VideoChat-R1 achieves a marked enhancement in tasks like temporal grounding (+31.8) and object tracking (+31.2), demonstrating several-fold improvements. Furthermore, the model exhibits promising advancements in general QA benchmarks, including VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9).

Multi-task reinforcement fine-tuning stands out as a key method, wherein limited sample sizes are leveraged to train models on spatio-temporal perception objectives. This approach is shown to be particularly effective in producing substantial task-specific improvements while maintaining the model’s overall general understanding capabilities.

Through their detailed ablation studies, the paper discusses the advantages of reinforcement fine-tuning over traditional supervised fine-tuning approaches. GRPO emerges as the preferred method for video task enhancement due to its reduced overfitting risk and superior performance retention in out-of-domain tasks.

Methodology and Implications

VideoChat-R1 employs GRPO by generating multiple candidate responses, scoring them based on format, IoU, accuracy, recall, and other task-specific criteria to identify the best outputs relative to the reference model. The model is updated to favor responses that align more closely with the ground truth across various video tasks.

The methodology introduces several reward functions tailored to specific video understanding tasks. For instance, IoU reward functions assess the intersection over union of predicted intervals with ground truth for temporal grounding and object tracking. Additionally, the paper discusses the implementation of format rewards and recall rewards aimed at maintaining consistent output structures throughout training.

The implications of this research extend beyond immediate task performance improvements. Reinforcement fine-tuning presents a promising avenue for future AI models, showcasing efficient training paradigms that could redefine video understanding tasks. The paper suggests the potential for large-scale, multi-task training scenarios, possibly accelerating advancements in complex video reasoning tasks.

Future Directions

The work opens avenues for further exploration into reinforcement learning applications across multimodal video understanding. Future research could investigate larger datasets, more diverse tasks, and comprehensive training regimens to further refine models like VideoChat-R1.

Moreover, the paper raises questions about the definition and evaluation of video reasoning tasks, suggesting that more sophisticated challenges could unlock deeper insights into reasoning processes in MLLMs. As these models evolve, the potential for integrating spatio-temporal reasoning with cognitive AI approaches remains an exciting prospect for AI development.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Xinhao Li (29 papers)
Ziang Yan (40 papers)
Desen Meng (4 papers)
Lu Dong (17 papers)
Xiangyu Zeng (16 papers)
Yinan He (34 papers)
Yali Wang (78 papers)
Yu Qiao (563 papers)
Yi Wang (1038 papers)
Limin Wang (221 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1910251925909430357

https://twitter.com/arxivsanitybot/status/1910885656936718364

YouTube

Show All Videos