- The paper introduces a two-stage chain-of-thought framework that explicitly separates temporal reasoning from spatial segmentation to improve control and interpretability.
- The paper employs reinforcement learning with agentic keyframe selection and multi-stage reward optimization to boost segmentation performance over state-of-the-art models.
- The paper demonstrates that modular MLLM architectures using explicit reasoning steps yield enhanced accuracy and interpretability in complex, multi-object video scenes.
Reinforced Chain-of-Thought For Video Reasoning and Segmentation: RCoT-Seg
Introduction
The RCoT-Seg framework ("RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation" (2605.07334)) advances Video Reasoning Segmentation (VRS) by introducing an explicit two-stage chain-of-thought (CoT) pipeline. VRS requires temporally and contextually grounded mask prediction in videos given natural language queries, necessitating robust temporal reasoning and high-fidelity mask localization in complex, multi-object scenes. Recent approaches have coupled sparse frame sampling and monolithic mask prediction, leading to poor robustness under ambiguous or implicit temporal instructions and failing to utilize chain-of-thought reasoning as a verifiable control variable. RCoT-Seg addresses these limitations with both architectural and optimization advancements.
Methodology
Pipeline Factorization: Temporal Reasoning and Spatial Grounding
The core innovation of RCoT-Seg is the explicit decomposition of VRS into Temporal Video Reasoning (TVR) and Keyframe Target Perception (KTP):
- TVR: The model generates a detailed video description and selects a candidate keyframe using the underlying MLLM (Qwen2.5-VL-3B), conditioned on both video and textual instruction. Crucially, keyframe selection is not a one-shot process; an Agentic Keyframe Selection (AKS) module evaluates and, if necessary, reselection is triggered through a structured self-evaluation loop, supervised by custom rewards.
- KTP: Given the (re)selected keyframe and video-level semantic scaffold from TVR, KTP predicts fine-grained bounding boxes and cues, which then prompt a frozen SAM2 module for high-resolution mask prediction and frame propagation.
This factorization explicitly separates temporal and spatial inference, making the model's intermediate reasoning steps actionable, interpretable, and correctable.
Agentic Keyframe Selection and Chain-of-Thought Self-Evaluation
AKS transforms keyframe selection into a verifiable Markovian decision process. It is trained to explicitly decide whether the currently selected frame meets the semantic constraints of the query, with the ability to autonomously resample (up to a fixed maximum) when evidence is deemed insufficient. This transforms previous static sampling into an adaptive and self-correcting control mechanism, which is particularly effective under occlusions, multi-object ambiguities, or implicit temporal queries.
Multi-Stage Reinforcement Learning: CoT-SFT and GRPO
The training pipeline includes:
- CoT-SFT pretraining: The model is initialized on a curated 28k-sample hybrid Chain-of-Thought dataset to jointly supervise AKS and KTP task outputs in structured formats, leveraging reasoning traces generated with Qwen2.5-VL-7B.
- GRPO-based RL: Post SFT, additional preference optimization is applied via Group-Relative Policy Optimization, using verifiable, highly-specific rewards (including a Hungarian-matching-based reward for KTP to robustly cover multi-object matching).
This dual-stage protocol ensures both stable reasoning emergence (via SFT) and robust reward-driven refinement targeted at intermediate control variables, not just end-task accuracy.
Empirical Results
RCoT-Seg consistently yields superior segmentation performance on standard benchmarks, surpassing recent SOTA models both with and without RL-based refinement:
- On ReVOS, RCoT-Seg-3B outperforms VRS-HQ-13B by +1.2 J&F, and GLUS-7B by +8.3 J&F on Reason VOS. The gain over GRPO-based Veason-R1 is +1.3 J&F (ReVOS) and +3.0 J&F (ReasonVOS).
- On the more challenging multi-object MeViS set, RCoT-Seg achieves +2.4 J&F over GLUS-7B, and +2.5 J&F over Veason-R1-3B.
- Scaling up the backbone (Qwen2.5-VL-7B) yields further gains of 2โ4 J&F points.
- Ablation studies confirm that both Agentic Keyframe Selection and Explicit CoT reasoning contribute to the performance gains; replacing agentic selection with one-pass sampling leads to statistically significant drops in aggregate and per-object mask quality.
The AKS module also yields robust improvements in keyframe quality (as measured by area ratio and localization precision) with minimal rounds of resampling, indicating high data efficiency.
Efficiency and Architectural Analysis
Timing analysis reveals that, while AKS introduces an ~81% overhead if full video encoding is re-performed for each reevaluation, a hybrid sampling scheme reduces this to 40% with negligible accuracy loss. Segmentation, not keyframe search, dominates total inference time.
Unified multi-task training with a single model for both reasoning (AKS) and grounding (KTP) outperforms split architectures. Explicit Chain-of-Thought guidance outperforms "Answer-Only" training by up to +3 J&F, indicating that structured reasoning is not merely a formatting tool but a critical intermediate control interface for MLLM-based VRS.
Limitations
RCoT-Seg, as implemented, is not equipped to handle zero-target queries (queries for absent objects), owing to the lack of negative samples in its training data. This produces failure cases where the reasoning trace may be correct but the downstream mask is produced erroneously. Additionally, while the AKS loop is robust and efficient, repeated high-resolution encoding can remain a bottleneck for large-scale deployment if not mitigated by sampling.
Theoretical and Practical Implications
RCoT-Seg demonstrates that factorizing VRS into explicit, verifiable CoT-driven intermediate states provides substantial upgrades in both reliability and accuracy over token-centric or direct SFT-based approaches. Critically, it shows that reinforcement learning can be effectively applied to control not just answers, but inference process structure and decision staging in vision-LLMs. This is a significant theoretical lesson for both interpretable chaining and compositional inference in multimodal tasks.
From a practical standpoint, RCoT-Seg presents a blueprint for modular MLLM architectures in video reasoning and high-resolution segmentation under weak or implicit supervision, applicable to surveillance, robotics, and human-computer interaction tasks.
Conclusion
RCoT-Seg introduces a unified, reinforcement-enhanced chain-of-thought model for video reasoning segmentation, realizing robust, interpretable, and high-performance scene understanding by explicit architectural decomposition and agentic intermediate-state supervision. It establishes new benchmarks across standard VRS and RVOS tasks, providing strong empirical evidence and a methodological paradigm for structured step-wise inference in future AI systems (2605.07334).