TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning (2504.09641v1)

Published 13 Apr 2025 in cs.CV

Abstract: Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.

Summary

Towards Smaller Video Reasoning Models: An Examination of TinyLLaVA-Video-R1

The paper "TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning," presents a methodological exploration into the domain of video reasoning using smaller-scale models. The motivation behind this research is rooted in the computational cost barriers associated with large-scale multimodal models, thus emphasizing the need for efficient alternatives for academic and practical pursuits. The authors introduce TinyLLaVA-Video-R1, a small-scale model with no more than 4 billion parameters, aimed at extending reasoning capabilities within multimodal domains, particularly video datasets.

Research Focus and Experimental Approach

The paper critiques the reliance of existing works on highly reasoning-intensive datasets, such as mathematics and code, which often necessitate large models as a base. Contrary to the conventional approach, this research interrogates the potential of smaller models in video reasoning. To achieve this, TinyLLaVA-Video-R1 leverages reinforcement learning to enhance reasoning capabilities on general video question-answering datasets. The paper not only validates improvement in thinking aptitudes but also identifies the emergent feature termed "aha moments."

The paper delineates the methodology adopted, discussing TinyLLaVA-Video as the foundational model. The framework employs Qwen2.5-3B as its LLM and SigLIP as a visual encoder. The architecture is complemented with GRPO algorithms adapted for reinforcement learning to incentivize reasoning, integrating a series of reward rules based on format compliance and accuracy. A cold start with human-annotated data is employed to stabilize training and promote format adherence. Across the myriad experiments, the model demonstrates proficiency in responding to video comprehension tasks and derivations in extensive benchmarks including MVBench, VideoMME, MLVU, and MMVU.

Empirical Findings

The results are compelling as TinyLLaVA-Video-R1 outperforms its supervised learning counterpart TinyLLaVA-Video-SFT across multiple benchmarks in reasoning tasks. Moreover, experiments highlight the necessity of continuous length rewards and cold-start data for stable and augmented training. The emergent phenomenon of "aha moments" further exemplifies introspective capabilities, as the model reassesses and verifies its cognitive process during reasoning.

Notable numerical results indicate the model's enhanced understanding with performance metrics surpassing various established baselines. For instance, TinyLLaVA-Video-R1 achieves significant accuracy improvements on MVBench compared to models such as InternVideo2, indicating its potential to rival models with larger parameter scales in video comprehension and analytical reasoning.

Implications and Future Directions

The implications of this paper are twofold: it indicates practical value for researchers constrained by limited computational resources and suggests theoretical potential in refining smaller models for complex reasoning tasks. As AI boundaries continue to expand, efficient multimodal comprehension models are integral to democratizing AI benefits across varied sectors and research domains.

Potential future developments include the introduction of high-quality reasoning datasets to test model limits further and the refinement of learning algorithms beyond GRPO to enhance reasoning proficiency substantively. Through continuous innovation and experimental rigor, efforts like TinyLLaVA-Video-R1 forge pathways for scalable, intelligent agents capable of nuanced video reasoning—thereby broadening accessibility to cutting-edge AI capabilities.

By underscoring the efficacy of smaller, more adaptable frameworks within the AI sphere, TinyLLaVA-Video-R1 affirms the promise that smaller models hold for future innovation in AI and machine learning.

Related Papers

GitHub

GitHub - ZhangXJ199/TinyLLaVA-Video-R1: TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning (10 stars)

Tweets

https://twitter.com/_akhaliq/status/1912051970753339819