VideoSEG-O3: Multi-Turn RL for RVOS
- The paper introduces an iterative chain-of-thought reinforcement learning approach that progressively refines segmentation masks with pixel-level accuracy.
- It models RVOS as a multi-turn Markov Decision Process, enhancing segmentation through a novel reward formulation and SEG-aware logit calibration.
- Experimental results reveal significant performance gains in temporal localization and spatial detail over baseline methods, validating the framework's design.
VideoSEG-O3 is a multi-turn reinforcement learning framework developed for Reasoning Video Object Segmentation (RVOS), a task that integrates temporal, spatial, and linguistic reasoning for granular, pixel-level localization in videos. Unlike previous segmentation architectures that rely on a single fixed pass over input frames, VideoSEG-O3 employs an explicit, iterative “coarse-to-fine” chain-of-thought (CoT). Each step successively narrows the temporal and spatial focus, guided by a learned policy that actively acquires and fuses visual evidence, and is optimized via a novel reward formulation that directly incorporates segmentation mask quality.
1. Multi-Turn Markov Decision Process and RL Objective
VideoSEG-O3 models RVOS as a Markov Decision Process (MDP), with each state consisting of the evolving set of visual observations , the language query , and the accumulated history encoding previous actions and reasoning traces. The action space alternates between:
<select>[t_\text{start}, t_\text{end}, k]: specifies a new temporal window and corresponding high-res keyframe, adding further evidence to .<answer>[SEG]: triggers final segmentation mask output via the hidden state associated with the[SEG]token.
Transitions update the state with new visual content or terminate the episode. Training utilizes a composite episodic reward:
where terms respectively reward format/correctness, temporal localization, segmentation mask accuracy, and progressive mask improvement. Optimization employs a calibrated Group Relative Policy Optimization (GRPO) objective augmented by auxiliary cross-entropy losses on sampled frame and keyframe segmentations:
This approach supplies dense gradient signals both through policy steps and direct mask supervision (Dai et al., 5 Jun 2026).
2. Hierarchical Temporal-Spatial Chain-of-Thought Reasoning
The system enforces a multi-turn, “coarse-to-fine” reasoning protocol:
- Initialization: comprises globally sampled low-resolution frames and sparse spatial frames.
- At each step , the policy conditions on 0 to select a new temporal interval and keyframe.
- Selected frames and reasoning traces are incrementally accumulated.
- Once sufficient evidence is detected, or a termination criterion is reached,
<answer>[SEG]outputs the final segmentation.
Each turn decomposes the reasoning via a Decoupled Thinking Trace comprising: (i) temporal understanding on low-res context, (ii) spatial analysis of high-res keyframes, and (iii) linguistic fusion for action selection or answer emission. The enforced output schema: 5 This enables the policy network to explicitly structure temporal-spatial exploration and leverage intermediate reasoning supervision.
3. SEG-Aware Logit Calibration for Mask-Sensitive RL
Standard text-based policy optimization gives no direct feedback on segmentation quality. VideoSEG-O3 introduces SEG-aware logit calibration in the policy head:
1
where 2 averages the pixel-wise likelihoods of the segmentation mask as output from the decoder's logits 3. This direct connection between token-level decisions and pixel-level segmentation fidelity ensures RL gradients reflect true mask accuracy, improving optimization stability and downstream performance (Dai et al., 5 Jun 2026).
4. Decoupled Thinking Trace Architecture
The architecture decomposes the cognitive trace along three axes:
- Stage A: Temporal Understanding – Operates on up to 20 low-resolution frames (28×28), generating coarse temporal candidates.
- Stage B: Spatial Detail Capturing – Consumes moderate-resolution spatial frames (128×28×28) and a high-res keyframe (512×28×28) for fine-grained spatial evidence extraction.
- Stage C: Expression Parsing – Fuses temporal and spatial representations with the query 4 via cross-attention, producing the next action or answer.
This division enables granular supervision, traceable reasoning, and stable long-horizon policy optimization in high-dimensional video RL settings.
5. VTS-CoT Dataset for Cold-Start Chain-of-Thought
A 6,000-sample “VTS-CoT” dataset provides cold-start Chain-of-Thought trajectories. Construction involves:
- Video curation from ReVOS, MeViS, Long-RVOS.
- Automated temporal labeling, mask-quality validation, and top-interval selection via LLM prompts.
- Multi-turn CoT synthesis producing 2–4 turn chains per example, annotated via strict global-to-verification prompt protocols.
- Annotation format: ordered JSON chain of steps with temporal and spatial selections per step.
VTS-CoT achieves diversity across video lengths, motion dynamics, and linguistic query types, seeding both SFT and RL stages with structured multi-hop reasoning signals (Dai et al., 5 Jun 2026).
6. Experimental Evaluation and Ablation Studies
VideoSEG-O3 demonstrates significant performance gains on standard RVOS and reasoning video benchmarks:
- Ref-MeViS (J&F): 60.0% (vs. 55.8% for UniPixel-7B)
- Ref-SAV: 65.5% (vs. 50.0%)
- Long-RVOS: 57.4% (vs. 51.3%)
- Reasoning ReVOS (overall): 67.7% (vs. 61.3% for Veason-R1)
- Zero-shot GroundMoRe: 31.96% (vs. 23.13%)
Ablation studies confirm the contribution of each module:
- CoT cold-start yields +2.11% average J&F.
- RL training provides +3.22% J&F and -20% rounds.
- SEG-calibration improves MeViS by +2.59%, Ref-SAV by +1.49%.
- Auxiliary spatial and keyframe losses are additive, contributing up to +1.44% J&F improvement.
- Reward shaping for progressive and keyframe localization is necessary for segment length and accuracy balance.
Bidirectional keyframe propagation in inference offers further +3.12% improvement on MeViS (Dai et al., 5 Jun 2026).
7. Significance and Context Within Video Reasoning Research
VideoSEG-O3 expands the scope of multi-modal video understanding by tightly integrating chain-of-thought reasoning, RL with dense mask feedback, and explicit temporal-spatial action decomposition. Unlike prior frameworks such as Open-o3 Video, which fuses temporal and spatial evidence in structured reasoning for general video QA (Meng et al., 23 Oct 2025), VideoSEG-O3 targets pixel-level RVOS with an explicit multi-turn, selection-based process and a robust mask-aware reinforcement objective. The methods introduced here—particularly SEG-calibrated policy training and tri-modal trace decoupling—provide new paradigms for RL-driven, explainable video reasoning, and segmentation performance at scale.
The framework's evidence-centered multi-turn nature enables dynamic exploration of lengthy, ambiguous, or complex videos, supporting fine-grained reference resolution and enhancing explainability through explicit trace outputs at each step (Dai et al., 5 Jun 2026). This approach demonstrates that joint, iterative policy learning with direct reward signals for every step of reasoning and segmentation can close the gap between language-driven high-level interpretation and low-level visual pixel accuracy.