VideoSEG-O3: Multi-Turn RL for RVOS

Updated 8 June 2026

The paper introduces an iterative chain-of-thought reinforcement learning approach that progressively refines segmentation masks with pixel-level accuracy.
It models RVOS as a multi-turn Markov Decision Process, enhancing segmentation through a novel reward formulation and SEG-aware logit calibration.
Experimental results reveal significant performance gains in temporal localization and spatial detail over baseline methods, validating the framework's design.

VideoSEG-O3 is a multi-turn reinforcement learning framework developed for Reasoning Video Object Segmentation (RVOS), a task that integrates temporal, spatial, and linguistic reasoning for granular, pixel-level localization in videos. Unlike previous segmentation architectures that rely on a single fixed pass over input frames, VideoSEG-O3 employs an explicit, iterative “coarse-to-fine” chain-of-thought (CoT). Each step successively narrows the temporal and spatial focus, guided by a learned policy that actively acquires and fuses visual evidence, and is optimized via a novel reward formulation that directly incorporates segmentation mask quality.

1. Multi-Turn Markov Decision Process and RL Objective

VideoSEG-O3 models RVOS as a Markov Decision Process (MDP), with each state $s_t = (\mathcal{V}_t, Q, \mathcal{H}_{t-1})$ consisting of the evolving set of visual observations $\mathcal{V}_t$ , the language query $Q$ , and the accumulated history $\mathcal{H}_{t-1}$ encoding previous actions and reasoning traces. The action space alternates between:

<select>[t_\text{start}, t_\text{end}, k]: specifies a new temporal window and corresponding high-res keyframe, adding further evidence to $\mathcal{V}_{t+1}$ .
<answer>[SEG]: triggers final segmentation mask output via the hidden state associated with the [SEG] token.

Transitions $T(s_t, a_t) \rightarrow s_{t+1}$ update the state with new visual content or terminate the episode. Training utilizes a composite episodic reward:

$\mathcal{R} = \sum_t [\mathcal{R}_f + \mathcal{R}_t + \mathcal{R}_m + \mathcal{R}_p]$

where terms respectively reward format/correctness, temporal localization, segmentation mask accuracy, and progressive mask improvement. Optimization employs a calibrated Group Relative Policy Optimization (GRPO) objective augmented by auxiliary cross-entropy losses on sampled frame and keyframe segmentations:

$\mathcal{L}(\theta) = \mathcal{L}_\text{GRPO}(\tilde\pi_\theta) + \lambda_\text{seg}(\mathcal{L}_\text{spatial} + \mathcal{L}_\text{key})$

This approach supplies dense gradient signals both through policy steps and direct mask supervision (Dai et al., 5 Jun 2026).

2. Hierarchical Temporal-Spatial Chain-of-Thought Reasoning

The system enforces a multi-turn, “coarse-to-fine” reasoning protocol:

Initialization: $\mathcal{V}_0$ comprises globally sampled low-resolution frames and sparse spatial frames.
At each step $t$ , the policy conditions on $\mathcal{V}_t$ 0 to select a new temporal interval and keyframe.
Selected frames and reasoning traces are incrementally accumulated.
Once sufficient evidence is detected, or a termination criterion is reached, <answer>[SEG] outputs the final segmentation.

Each turn decomposes the reasoning via a Decoupled Thinking Trace comprising: (i) temporal understanding on low-res context, (ii) spatial analysis of high-res keyframes, and (iii) linguistic fusion for action selection or answer emission. The enforced output schema: $\mathcal{V}_t$ 5 This enables the policy network to explicitly structure temporal-spatial exploration and leverage intermediate reasoning supervision.

3. SEG-Aware Logit Calibration for Mask-Sensitive RL

Standard text-based policy optimization gives no direct feedback on segmentation quality. VideoSEG-O3 introduces SEG-aware logit calibration in the policy head:

$\mathcal{V}_t$ 1

where $\mathcal{V}_t$ 2 averages the pixel-wise likelihoods of the segmentation mask as output from the decoder's logits $\mathcal{V}_t$ 3. This direct connection between token-level decisions and pixel-level segmentation fidelity ensures RL gradients reflect true mask accuracy, improving optimization stability and downstream performance (Dai et al., 5 Jun 2026).

4. Decoupled Thinking Trace Architecture

The architecture decomposes the cognitive trace along three axes:

Stage A: Temporal Understanding – Operates on up to 20 low-resolution frames (28×28), generating coarse temporal candidates.
Stage B: Spatial Detail Capturing – Consumes moderate-resolution spatial frames (128×28×28) and a high-res keyframe (512×28×28) for fine-grained spatial evidence extraction.
Stage C: Expression Parsing – Fuses temporal and spatial representations with the query $\mathcal{V}_t$ 4 via cross-attention, producing the next action or answer.

This division enables granular supervision, traceable reasoning, and stable long-horizon policy optimization in high-dimensional video RL settings.

5. VTS-CoT Dataset for Cold-Start Chain-of-Thought

A 6,000-sample “VTS-CoT” dataset provides cold-start Chain-of-Thought trajectories. Construction involves:

Video curation from ReVOS, MeViS, Long-RVOS.
Automated temporal labeling, mask-quality validation, and top-interval selection via LLM prompts.
Multi-turn CoT synthesis producing 2–4 turn chains per example, annotated via strict global-to-verification prompt protocols.
Annotation format: ordered JSON chain of steps with temporal and spatial selections per step.

VTS-CoT achieves diversity across video lengths, motion dynamics, and linguistic query types, seeding both SFT and RL stages with structured multi-hop reasoning signals (Dai et al., 5 Jun 2026).

6. Experimental Evaluation and Ablation Studies

VideoSEG-O3 demonstrates significant performance gains on standard RVOS and reasoning video benchmarks:

Ref-MeViS (J&F): 60.0% (vs. 55.8% for UniPixel-7B)
Ref-SAV: 65.5% (vs. 50.0%)
Long-RVOS: 57.4% (vs. 51.3%)
Reasoning ReVOS (overall): 67.7% (vs. 61.3% for Veason-R1)
Zero-shot GroundMoRe: 31.96% (vs. 23.13%)

Ablation studies confirm the contribution of each module:

CoT cold-start yields +2.11% average J&F.
RL training provides +3.22% J&F and -20% rounds.
SEG-calibration improves MeViS by +2.59%, Ref-SAV by +1.49%.
Auxiliary spatial and keyframe losses are additive, contributing up to +1.44% J&F improvement.
Reward shaping for progressive and keyframe localization is necessary for segment length and accuracy balance.

Bidirectional keyframe propagation in inference offers further +3.12% improvement on MeViS (Dai et al., 5 Jun 2026).

7. Significance and Context Within Video Reasoning Research

VideoSEG-O3 expands the scope of multi-modal video understanding by tightly integrating chain-of-thought reasoning, RL with dense mask feedback, and explicit temporal-spatial action decomposition. Unlike prior frameworks such as Open-o3 Video, which fuses temporal and spatial evidence in structured reasoning for general video QA (Meng et al., 23 Oct 2025), VideoSEG-O3 targets pixel-level RVOS with an explicit multi-turn, selection-based process and a robust mask-aware reinforcement objective. The methods introduced here—particularly SEG-calibrated policy training and tri-modal trace decoupling—provide new paradigms for RL-driven, explainable video reasoning, and segmentation performance at scale.

The framework's evidence-centered multi-turn nature enables dynamic exploration of lengthy, ambiguous, or complex videos, supporting fine-grained reference resolution and enhancing explainability through explicit trace outputs at each step (Dai et al., 5 Jun 2026). This approach demonstrates that joint, iterative policy learning with direct reward signals for every step of reasoning and segmentation can close the gap between language-driven high-level interpretation and low-level visual pixel accuracy.

Markdown Report Issue Upgrade to Chat

References (2)

VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation (2026)

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VideoSEG-O3 Framework.