Papers
Topics
Authors
Recent
2000 character limit reached

Video-R1-COT-165k Dataset Overview

Updated 12 January 2026
  • The dataset provides a large-scale, multimodal corpus with 165,000 CoT examples that guide MLLMs in developing coherent, multi-step video and image reasoning.
  • It employs an automated annotation pipeline using advanced vision–language models to generate detailed rationales for spatial, temporal, causal, and quantitative inferences.
  • Designed for supervised fine-tuning in the Video-R1 pipeline, the dataset addresses cold-start challenges in reinforcement learning and boosts performance on video reasoning benchmarks.

The Video-R1-CoT-165k dataset is a large-scale, multimodal corpus of chain-of-thought (CoT) annotated question–answer pairs, specifically constructed to advance the reasoning capabilities of multimodal LLMs (MLLMs) in both video and image domains. It was introduced as part of the Video-R1 training framework to provide high-quality supervised fine-tuning data aimed at overcoming the cold-start limitations of reinforcement learning strategies in video reasoning tasks (Feng et al., 27 Mar 2025). The dataset comprises 165,000 examples, each paired with detailed, step-by-step rationales generated via advanced vision–LLMs. This corpus is a response to the absence of sufficiently complex CoT datasets tailored for temporal, causal, spatial, and quantitative inference over video datasets.

1. Motivation and Objectives

Two principal motivations underlie Video-R1-CoT-165k. First, rule-based reinforcement learning (RL) methods such as GRPO exhibit poor cold-start behavior—without intermediate supervision, MLLMs tend to output short or trivial reasoning chains. Second, extant video understanding benchmarks are mostly oriented toward action recognition or retrieval, lacking instructional supervision for explicit, multi-step visual inference. By curating a large corpus encompassing both image-based and video-based questions with diverse reasoning demands, Video-R1-CoT-165k serves as a bootstrapping resource to endow MLLMs with spatial, temporal, causal, and mathematical reasoning capacity at scale (Feng et al., 27 Mar 2025).

2. Dataset Construction and Composition

Video-R1-CoT-165k consists of exactly 165,000 samples. Each record may be based on a static image or a short video clip, paired with a question, correct answer, and multi-step chain-of-thought rationale. Within the broader Video-R1-260k pool from which CoT-165k was derived, modality and task breakdown is as follows:

Task Type Approximate Count Proportion
General video reasoning 116,000 44.6%
General image QA 15,000 5.8%
Chart reasoning 21,000 8.1%
OCR/text recognition 16,000 6.2%
Math reasoning 37,000 14.2%
Commonsense/expert QA 37,000 14.2%
Spatial logic 20,000 7.7%

All video entries were trimmed or sampled to a maximum of 16 frames, representing 2–8 second clips at roughly 2 fps. Image-based examples are balanced by task difficulty via stratified sampling.

3. Data Sources and Preprocessing

Raw videos for Video-R1-CoT-165k were sourced from open-domain video collections (notably YouTube-8M style datasets and academic benchmarks), while image QA items were drawn from chart corpora, OCR datasets, MathQA figures, and commonsense vision resources. Clips were filtered for copyright-safety. Video segments were automatically sampled, and images stratified, ensuring task coverage and diversity. During SFT training, video frames were processed at a network resolution of 128×28×28, with 256×28×28 used for inference (Feng et al., 27 Mar 2025).

4. Annotation Pipeline and Chain-of-Thought Generation

The annotation workflow leveraged Qwen2.5-VL-72B to automatically generate detailed chain-of-thought rationales for all candidate samples. Rule-based filters excluded explanations that (a) lacked explicit intermediate steps, (b) contained hallucinated facts, or (c) failed to meet a minimum length threshold (lmin=320l_{\mathrm{min}} = 320 tokens). After filtering, the final corpus maintained an average CoT length of

μL=1N∑i=1Nlen(CoTi)≈350 tokens,σL≈75\mu_L = \frac{1}{N}\sum_{i=1}^N \mathrm{len}(\mathrm{CoT}_i) \approx 350\ \text{tokens} ,\quad \sigma_L \approx 75

Each sample is stored as a JSONL record:

1
2
3
4
5
6
7
8
9
{
  "id": "vid_01234",
  "modality": "video",
  "frames": ["frame_0001.jpg", ...],
  "question": "...",
  "choices": ["A", "B", ...],
  "answer": "A",
  "cot": "Step 1: ... Step 2: ... Step n: ... Therefore the correct answer is ..."
}

CoT annotations may include formulas relevant to the reasoning process (e.g., pixel displacement, elapsed time, object counting, spatial relations).

5. Dataset Format, Licensing, and Access

Video-R1-CoT-165k is distributed as a single supervised fine-tuning (SFT) corpus, with all 165,000 examples used for vanilla SFT—there are no published train/validation/test splits for the corpus, though researchers may partition as needed. Data are released under the MIT License at https://github.com/tulerfeng/Video-R1. The directory structure includes:

1
2
/data/SFT/Video-R1-COT-165k.jsonl    # Main annotation file
/data/SFT/video_frames/              # Raw video frames for each sample

No reward functions are applied during SFT; such objectives are employed in the RL phase with Video-R1-260k.

6. Applications and Benchmarking

Video-R1-CoT-165k is integral to the Video-R1 training pipeline, particularly in the SFT cold-start phase. The dataset is used to endow a base MLLM (Qwen2.5-VL-7B) with coherent, multi-step reasoning capabilities before subsequent reinforcement learning with extended datasets. Downstream evaluation of models fine-tuned on this corpus occurs on public benchmarks including VideoMMMU, VSI-Bench, MVBench, and TempCompass. Notably, Video-R1-7B trained using this dataset achieves competitive accuracy on VSI-Bench spatial reasoning tasks (Feng et al., 27 Mar 2025).

7. Limitations and Prospective Extensions

The chain-of-thought annotations are machine-generated, with filtering and quality control only at the algorithmic level—no human-verified inter-annotator agreement metrics are reported. This suggests possible inheritance of model biases (e.g., stereotyped reasoning, preferred domains). The corpus primarily covers everyday scenarios and short clips; longer or specialized narratives (e.g., surveillance, complex sports) are underrepresented. The absence of manual deduplication or process supervision leaves room for ambiguous or trivial CoTs. A plausible implication is that future iterations may benefit from expanded human-in-the-loop review, increased domain coverage, or curriculum learning strategies to address these gaps.

8. Relation to Other Datasets and Research Directions

Distinct from contemporaneous datasets such as Vad-Reasoning (Huang et al., 26 May 2025), SEED-Bench-R1 (Chen et al., 31 Mar 2025), and Video-CoT (Zhang et al., 10 Jun 2025), which each target different facets of video understanding (anomaly detection and reasoning, next-action planning, and spatiotemporal comprehension, respectively), Video-R1-CoT-165k is positioned as the first large-scale CoT corpus intended to teach MLLMs comprehensive video and image reasoning via explicit chain-of-thought supervision. Related works have converged on similar annotation strategies (e.g., multi-phase CoT pipelines, prompt templates, stepwise inference) and evaluation metrics, but differ in domain focus, annotation granularity, and dataset scale. Future research may integrate lessons from these parallel efforts, such as hybrid auto–manual annotation pipelines and enhanced reward modeling for RL.


References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Video-R1-COT-165k Dataset.