Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning (2510.23473v1)

Published 27 Oct 2025 in cs.CV

Abstract: Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal LLMs (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.

Summary

The paper introduces Video-Thinker, a novel framework that integrates temporal grounding and captioning within MLLMs using reinforcement learning.
It leverages a synthesized Video-Thinker-10K dataset and a two-stage training approach (SFT and GRPO) to enhance structured video reasoning.
Experimental results show significant improvements in temporal localization and captioning, achieving state-of-the-art performance on multiple video reasoning benchmarks.

Video-Thinker: Reinforcement Learning for Structured Video Reasoning in MLLMs

Introduction and Motivation

The Video-Thinker framework addresses the challenge of enabling Multimodal LLMs (MLLMs) to perform structured, temporally-aware reasoning over video data. While prior work in "Thinking with Images" has demonstrated the efficacy of interleaving visual operations (e.g., cropping, zooming) with chain-of-thought (CoT) reasoning for static images, video understanding introduces additional complexity due to temporal dependencies, motion, and evolving narratives. Existing MLLMs typically treat videos as monolithic inputs or rely on handcrafted chain-of-thought prompts, limiting their ability to autonomously localize and synthesize information across temporal segments.

Video-Thinker proposes a paradigm shift: integrating "grounding" (temporal localization) and "captioning" (segment-level comprehension) as intrinsic capabilities within the reasoning process, trained end-to-end via reinforcement learning. This approach eliminates the need for external tool invocation and enables models to dynamically navigate and analyze video content during inference.

Figure 1: Video-Thinker integrates grounding'' andcaptioning'' capabilities throughout the reasoning process using end-to-end reinforcement learning.

Data Synthesis: Video-Thinker-10K

A critical component of Video-Thinker is the construction of the Video-Thinker-10K dataset, which provides high-quality supervision for both grounding and captioning within reasoning traces. The dataset is synthesized from six diverse video sources, spanning human activities, instructional tutorials, cooking procedures, situated reasoning, and long-form content. The data curation pipeline employs a hindsight-curation strategy:

For caption-labeled datasets, complex multi-segment reasoning questions are generated using DeepSeek-R1.
For QA-labeled datasets, Gemini-2.5-Flash-Lite is used to produce answer-conditioned segment captions.
DeepSeek-V3 performs reverse-curation to synthesize structured reasoning traces, explicitly annotating temporal localization (<time>), visual evidence (<caption>), and analytical reasoning (> ).
Quality control is automated via Qwen2.5-VL-7B-Instruct, which verifies that the synthesized grounding and captioning content enables correct answer derivation, iteratively refining samples as needed.
Figure 2: Data synthesis pipeline of Video-Thinker-10K where the data distribution is depicted in Figure~\ref{fig:dataset}.

Figure 3: The data distribution of our Video-Thinker-10K dataset.

Model Training: SFT and GRPO

Video-Thinker employs a two-stage training strategy:
1. Supervised Fine-Tuning (SFT): The model is initialized to follow the structured reasoning format, learning to generate traces with explicit grounding and captioning tags.
2. Group Relative Policy Optimization (GRPO): Reinforcement learning is applied, where only the final answer is rewarded. Multiple candidate traces are generated per sample, and the policy is updated based on relative correctness and format adherence within each group. The advantage function is normalized across candidates, and KL regularization ensures stability.
This approach incentivizes the emergence of autonomous temporal navigation and segment-level reasoning, rather than rote format imitation.

Experimental Results

Video-Thinker-7B achieves state-of-the-art (SOTA) performance among 7B-sized MLLMs across both in-domain and out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Notably, it outperforms Video-R1-7B and other competitive baselines by substantial margins, especially in out-of-domain generalization.
- Video-Holmes: 43.22% accuracy (+4.68% over best baseline)
- CG-Bench-Reasoning: 33.25% accuracy (+3.81%)
- VRBench: 80.69% accuracy (+11.44%)
Increasing the number of input frames during inference further improves performance, with Video-Thinker-7B maintaining superiority across all tested frame counts.
Figure 4: An example of Video-Thinker-7B's reasoning output on CG-Bench-Reasoning dataset.

Figure 5: An example of Video-Thinker-7B's reasoning output on Video-Holmes dataset.

Figure 6: An example of Video-Thinker-7B's reasoning output on VRBench dataset.

Analysis of Grounding and Captioning Capabilities

Quantitative evaluation demonstrates that Video-Thinker-7B exhibits markedly improved temporal grounding and captioning abilities compared to both the base model (Qwen2.5-VL-7B-Instruct) and Video-R1-7B:
- Grounding (mIoU): 48.22% (vs. 27.47% baseline)
- [email protected]: 79.29% (vs. 39.52%)
- Captioning (METEOR): 15.87% (vs. 14.10%)
- Captioning (ROUGE-L): 20.11% (vs. 14.91%)
Oracle experiments confirm that providing accurate grounding and captioning annotations to baseline models yields significant performance gains, underscoring the importance of these capabilities for video reasoning.
Figure 7: An example of Video-Thinker-7B's reasoning output on CG-Bench dataset.

Figure 8: An example demonstrates Video-R1-7B's inability to follow instructions for generating temporal grounding content within mygreen{<time></time>} tags.

Self-Reflective Reasoning and "Aha Moments"

Video-Thinker-7B demonstrates metacognitive behaviors, periodically revisiting and refining its initial interpretations of grounding and captioning tasks within the reasoning trace. This self-corrective process, analogous to "aha moments," indicates the model's capacity for dynamic internal feedback and error correction, rather than static pattern matching.
Figure 9: An example of Video-Thinker-7B's reasoning output on Video-Holmes dataset.

Figure 10: An example of Video-Thinker-7B's reasoning output on VRBench dataset.

Figure 11: An example of Video-Thinker-7B's reasoning output on CG-Bench dataset.

Figure 12: An example of Video-Thinker-7B's reasoning output on CG-Bench dataset.

Figure 13: An example of Video-Thinker-7B's reasoning output on Video-Holmes dataset.

Implementation Considerations
- Resource Requirements: Training is performed on Qwen2.5-VL-7B-Instruct, with SFT and GRPO stages requiring moderate GPU resources. Videos are subsampled to 16 frames at $128 \times 28 \times 28$ resolution for efficiency.
- Prompt Engineering: Structured prompts with explicit <time>, <caption>, and <think> tags are essential for both training and evaluation.
- Scaling: Performance improves with increased frame count and optimal learning rate (5e-6 for GRPO). Overfitting is observed beyond 2500 GRPO steps.
- Limitations: The current framework is restricted to grounding and captioning; extension to additional modalities (e.g., audio) or larger model sizes is a natural next step.
Implications and Future Directions

Video-Thinker demonstrates that intrinsic grounding and captioning capabilities, when integrated into the reasoning process and trained via reinforcement learning, are critical for robust video understanding in MLLMs. The framework achieves SOTA results with significantly less training data (10K samples vs. 160K in prior work), highlighting the efficiency of structured supervision and outcome-based RL.

Theoretically, this work advances the paradigm of multimodal reasoning by treating temporal localization and segment-level comprehension as first-class operations within CoT traces. Practically, it enables the deployment of MLLMs for complex video analysis tasks without reliance on external tools or handcrafted prompts.

Future research should explore scaling to larger models, incorporating additional modalities, and developing more sophisticated intrinsic capabilities (e.g., event detection, causal inference). The integration of self-reflective reasoning mechanisms may further enhance model robustness and interpretability.

Conclusion

Video-Thinker establishes a new standard for structured video reasoning in MLLMs by intrinsically integrating grounding and captioning within the chain-of-thought process, trained end-to-end via reinforcement learning. The approach yields strong empirical gains in both generalization and temporal manipulation capabilities, with clear implications for the design of next-generation multimodal reasoning systems.