Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

112 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Video-R1-COT-165k: CoT Video Reasoning Dataset

Updated 11 July 2025

Video-R1-COT-165k is a multimodal dataset with 165k question-answer pairs and detailed chain-of-thought rationales for temporally grounded video reasoning.
The dataset guides model training through supervised fine-tuning and specialized RL algorithms like T-GRPO, emphasizing stepwise inference over mere pattern recognition.
Its integration of image and video data significantly boosts temporal and spatial reasoning, setting new benchmarks in complex video understanding tasks.

Video-R1-COT-165k denotes a large-scale video and image reasoning dataset specifically constructed to enable chain-of-thought (CoT) supervision and facilitate the training of multimodal LLMs (MLLMs) in temporally grounded video reasoning. Originating from the Video-R1 framework, this dataset plays a pivotal role in pre-training and fine-tuning models for complex, stepwise video understanding tasks, and marks a significant advancement in leveraging reinforcement learning (RL) for temporal reasoning with multimodal data (2503.21776).

1. Motivation and Background

Video reasoning tasks for MLLMs have historically been constrained by the lack of temporally explicit modeling and high-quality datasets suited to multi-step, causal, and spatio-temporal reasoning. Prior multimodal benchmarks and datasets predominantly emphasized object recognition or action classification, often using static images or relying on shallow multiple-choice formats which allowed models to "shortcut" solutions via pattern recognition rather than true reasoning. The Video-R1 paradigm was inspired by the DeepSeek-R1 approach in language and image domains, aiming to systematically leverage rule-based reinforcement learning to elicit longer, more logically cohesive CoT reasoning in videos.

2. Dataset Construction and Structure

Video-R1-COT-165k consists of approximately 165,000 question–answer pairs, each accompanied by a detailed chain-of-thought rationale. The unique characteristics of the dataset are:

Multimodal Scope: It combines both image-based and video-based reasoning examples, enabling the transfer and integration of static and dynamic reasoning patterns.
Chain-of-Thought Supervision: Each sample includes stepwise rationales bridging question and answer, explicitly laying out the model's internal inference process, crucial for learning long-range temporal and relational dependencies.
Diverse Domains: The dataset is stratified across varied categories, including general video scenarios, chart reasoning, OCR, mathematical reasoning, knowledge comprehension, and spatial reasoning. This breadth ensures comprehensive skill coverage and robustness.
Role in Training Paradigm: Video-R1-COT-165k is specifically designed for the supervised fine-tuning (SFT) "cold start" phase. In RL-based training schemes, this dataset prepares models with initial chain-of-thought capabilities before additional refinement with reinforcement learning stages.

3. Training and Methodologies

The primary advantage of the Video-R1-COT-165k dataset is its suitability for training under the Group Relative Policy Optimization (GRPO) and the temporally enhanced T-GRPO framework:

Supervised Fine-Tuning (SFT): Models are first fine-tuned on Video-R1-COT-165k to instill baseline multi-step reasoning abilities. Each training instance goes beyond predicting the final answer, guiding the network to generate intermediate reasoning steps consistent with expert-annotated CoTs.
Reinforcement Learning (RL) with T-GRPO: Following supervised pre-training, models are further optimized via the T-GRPO algorithm, wherein the temporal dimension is explicitly rewarded. The model is exposed to both temporally ordered and randomly shuffled video frames:

$r_t = \begin{cases} \alpha & \text{if } p > \mu \cdot \tilde{p} \ 0 & \text{otherwise} \end{cases}$

where $p$ and $\tilde{p}$ denote the proportions of correct answers with ordered and shuffled frames, $\alpha$ (0.3) sets reward strength, and $\mu$ (0.8) is a relative performance threshold. The final reward for a correct output becomes $r_i^{(\text{T-GRPO})} = r_i + r_t$ , selectively incentivizing temporal-sensitive solutions. The advantage function used for policy optimization is

$A_i = \frac{r_i^{(\text{T-GRPO})} - \text{mean}}{\text{std}}$

guiding the policy updates via a clipped surrogate objective as in standard GRPO.

Integration of Image and Video Data: The inclusion of high-quality image reasoning data alongside video samples provides a scaffold for general reasoning, which, according to ablation studies, is essential for bootstrapping and preventing overfitting to temporal cues alone.

4. Experimental Impact and Benchmarking

Video-R1-COT-165k—when combined with the subsequent larger RL dataset (Video-R1-260k) and trained using T-GRPO—produces substantial gains across state-of-the-art video reasoning benchmarks:

On VSI-Bench (video spatial reasoning benchmark), Video-R1-7B attains an accuracy of 35.8%, surpassing proprietary models such as GPT-4o.
The approach shows consistent improvements across VideoMMMU, MMVU, MVBench, TempCompass, and VideoMME, robustly outperforming prior models oriented towards perception or static answer selection.
Ablation studies indicate that both image data and temporal reward mechanisms are critical for optimal performance; removing either component leads to declines in benchmark metrics.
Increasing the input frame count from 16 to 32 positively correlates with higher accuracy, emphasizing the value of deep temporal context for reasoning.

Video-R1-COT-165k addresses limitations observed in earlier datasets:

In contrast to multiple-choice video QA sets or those without explicit rationales, Video-R1-COT-165k enforces open-ended, stepwise reasoning outputs, making it more suitable for evaluating and enhancing the true inference abilities of MLLMs.
Whereas other datasets such as VideoCoT and Video-CoT emphasize annotation efficiency or spatiotemporal coverage, Video-R1-COT-165k’s unique role lies in facilitating RL-based reasoning training, providing the necessary granularity for both SFT initialization and RL policy shaping (2407.05355, 2506.08817).
The dataset is also used as a baseline for comparative evaluations in studies that extend R1-style training (e.g., VideoRFT, Fact-R1) and for probing the transferability of reasoning skills from images to videos (2505.12434, 2505.16836).

6. Technical and Research Implications

The development of Video-R1-COT-165k has significant implications for the MLLM research community:

Data Availability: Public release of the dataset and accompanying models lowers the barrier for benchmarking temporal and stepwise reasoning, fostering reproducibility and further innovation.
Framework Generalization: The success of Video-R1-COT-165k under T-GRPO establishes prototype methodologies for future RL-enhanced CoT datasets—suggesting applicability to longitudinal, fine-grained, or domain-specialized reasoning tasks.
Curricular and Scalability Insights: Empirical results underscore the importance of initial CoT SFT (using datasets like Video-R1-COT-165k) for stable training before RL, as well as the benefit of increasing video length and input diversity.
Limitations: The dataset, while broad, is still subject to noise in the reasoning annotations and potential shortcutting. Future work is directed towards scaling input frames, refining temporal reward computation, and exploring dynamic response lengths to optimize reasoning depth and conciseness.

7. Future Directions

Proposed enhancements flowing from the use and evaluation of Video-R1-COT-165k include:

Expansion of input frame counts to enable capture of longer-range dependencies.
Development of more efficient or context-sensitive temporal modeling to reduce the computational demands of contrastive temporal evaluation.
Implementation of adaptive response-length strategies and scaling RL training steps further to fine-tune complex reasoning trajectories.
Investigation into principled domain transfer schemes, leveraging the synergy of image-video reasoning data, and strengthening the process-based reward signal for even closer alignment to human-like, interpretable reasoning.

In conclusion, Video-R1-COT-165k serves as a cornerstone dataset for chain-of-thought-based video reasoning in MLLMs, underpinning the Video-R1 paradigm’s advances via supervised and reinforcement learning. It provides a comprehensive, richly annotated foundation for training and benchmarking temporally-aware, logically robust multimodal models, with broad impact on the evolution of video understanding systems and methodologies in academic and applied research (2503.21776).

PDF Markdown Chat (Upgrade)

References (5)

Video-R1: Reinforcing Video Reasoning in MLLMs (2025)

VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool (2024)

Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought (2025)

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning (2025)

Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning (2025)