Quality control is automated via Qwen2.5-VL-7B-Instruct, which verifies that the synthesized grounding and captioning content enables correct answer derivation, iteratively refining samples as needed.
Figure 2: Data synthesis pipeline of Video-Thinker-10K where the data distribution is depicted in Figure~\ref{fig:dataset}.
Figure 3: The data distribution of our Video-Thinker-10K dataset.
Model Training: SFT and GRPO
Video-Thinker employs a two-stage training strategy:
Supervised Fine-Tuning (SFT): The model is initialized to follow the structured reasoning format, learning to generate traces with explicit grounding and captioning tags.
- Group Relative Policy Optimization (GRPO): Reinforcement learning is applied, where only the final answer is rewarded. Multiple candidate traces are generated per sample, and the policy is updated based on relative correctness and format adherence within each group. The advantage function is normalized across candidates, and KL regularization ensures stability.
This approach incentivizes the emergence of autonomous temporal navigation and segment-level reasoning, rather than rote format imitation.
Experimental Results
Video-Thinker-7B achieves state-of-the-art (SOTA) performance among 7B-sized MLLMs across both in-domain and out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Notably, it outperforms Video-R1-7B and other competitive baselines by substantial margins, especially in out-of-domain generalization.
- Video-Holmes: 43.22% accuracy (+4.68% over best baseline)
- CG-Bench-Reasoning: 33.25% accuracy (+3.81%)
- VRBench: 80.69% accuracy (+11.44%)
Increasing the number of input frames during inference further improves performance, with Video-Thinker-7B maintaining superiority across all tested frame counts.
Figure 4: An example of Video-Thinker-7B's reasoning output on CG-Bench-Reasoning dataset.
Figure 5: An example of Video-Thinker-7B's reasoning output on Video-Holmes dataset.
Figure 6: An example of Video-Thinker-7B's reasoning output on VRBench dataset.
Analysis of Grounding and Captioning Capabilities
Quantitative evaluation demonstrates that Video-Thinker-7B exhibits markedly improved temporal grounding and captioning abilities compared to both the base model (Qwen2.5-VL-7B-Instruct) and Video-R1-7B:
- Grounding (mIoU): 48.22% (vs. 27.47% baseline)
- [email protected]: 79.29% (vs. 39.52%)
- Captioning (METEOR): 15.87% (vs. 14.10%)
- Captioning (ROUGE-L): 20.11% (vs. 14.91%)
Oracle experiments confirm that providing accurate grounding and captioning annotations to baseline models yields significant performance gains, underscoring the importance of these capabilities for video reasoning.
Figure 7: An example of Video-Thinker-7B's reasoning output on CG-Bench dataset.
Figure 8: An example demonstrates Video-R1-7B's inability to follow instructions for generating temporal grounding content within mygreen{<time></time>} tags.
Self-Reflective Reasoning and "Aha Moments"
Video-Thinker-7B demonstrates metacognitive behaviors, periodically revisiting and refining its initial interpretations of grounding and captioning tasks within the reasoning trace. This self-corrective process, analogous to "aha moments," indicates the model's capacity for dynamic internal feedback and error correction, rather than static pattern matching.
Figure 9: An example of Video-Thinker-7B's reasoning output on Video-Holmes dataset.
Figure 10: An example of Video-Thinker-7B's reasoning output on VRBench dataset.
Figure 11: An example of Video-Thinker-7B's reasoning output on CG-Bench dataset.
Figure 12: An example of Video-Thinker-7B's reasoning output on CG-Bench dataset.
Figure 13: An example of Video-Thinker-7B's reasoning output on Video-Holmes dataset.
Implementation Considerations
- Resource Requirements: Training is performed on Qwen2.5-VL-7B-Instruct, with SFT and GRPO stages requiring moderate GPU resources. Videos are subsampled to 16 frames at 128×28×28 resolution for efficiency.
- Prompt Engineering: Structured prompts with explicit <time>, <caption>, and <think> tags are essential for both training and evaluation.
- Scaling: Performance improves with increased frame count and optimal learning rate (5e-6 for GRPO). Overfitting is observed beyond 2500 GRPO steps.
- Limitations: The current framework is restricted to grounding and captioning; extension to additional modalities (e.g., audio) or larger model sizes is a natural next step.
Implications and Future Directions
Video-Thinker demonstrates that intrinsic grounding and captioning capabilities, when integrated into the reasoning process and trained via reinforcement learning, are critical for robust video understanding in MLLMs. The framework achieves SOTA results with significantly less training data (10K samples vs. 160K in prior work), highlighting the efficiency of structured supervision and outcome-based RL.
Theoretically, this work advances the paradigm of multimodal reasoning by treating temporal localization and segment-level comprehension as first-class operations within CoT traces. Practically, it enables the deployment of MLLMs for complex video analysis tasks without reliance on external tools or handcrafted prompts.
Future research should explore scaling to larger models, incorporating additional modalities, and developing more sophisticated intrinsic capabilities (e.g., event detection, causal inference). The integration of self-reflective reasoning mechanisms may further enhance model robustness and interpretability.
Conclusion
Video-Thinker establishes a new standard for structured video reasoning in MLLMs by intrinsically integrating grounding and captioning within the chain-of-thought process, trained end-to-end via reinforcement learning. The approach yields strong empirical gains in both generalization and temporal manipulation capabilities, with clear implications for the design of next-generation multimodal reasoning systems.