Papers
Topics
Authors
Recent
Search
2000 character limit reached

VideoSEG-O3: Multi-Turn RL for RVOS

Updated 8 June 2026
  • The paper introduces an iterative chain-of-thought reinforcement learning approach that progressively refines segmentation masks with pixel-level accuracy.
  • It models RVOS as a multi-turn Markov Decision Process, enhancing segmentation through a novel reward formulation and SEG-aware logit calibration.
  • Experimental results reveal significant performance gains in temporal localization and spatial detail over baseline methods, validating the framework's design.

VideoSEG-O3 is a multi-turn reinforcement learning framework developed for Reasoning Video Object Segmentation (RVOS), a task that integrates temporal, spatial, and linguistic reasoning for granular, pixel-level localization in videos. Unlike previous segmentation architectures that rely on a single fixed pass over input frames, VideoSEG-O3 employs an explicit, iterative “coarse-to-fine” chain-of-thought (CoT). Each step successively narrows the temporal and spatial focus, guided by a learned policy that actively acquires and fuses visual evidence, and is optimized via a novel reward formulation that directly incorporates segmentation mask quality.

1. Multi-Turn Markov Decision Process and RL Objective

VideoSEG-O3 models RVOS as a Markov Decision Process (MDP), with each state st=(Vt,Q,Ht1)s_t = (\mathcal{V}_t, Q, \mathcal{H}_{t-1}) consisting of the evolving set of visual observations Vt\mathcal{V}_t, the language query QQ, and the accumulated history Ht1\mathcal{H}_{t-1} encoding previous actions and reasoning traces. The action space alternates between:

  • <select>[t_\text{start}, t_\text{end}, k]: specifies a new temporal window and corresponding high-res keyframe, adding further evidence to Vt+1\mathcal{V}_{t+1}.
  • <answer>[SEG]: triggers final segmentation mask output via the hidden state associated with the [SEG] token.

Transitions T(st,at)st+1T(s_t, a_t) \rightarrow s_{t+1} update the state with new visual content or terminate the episode. Training utilizes a composite episodic reward:

R=t[Rf+Rt+Rm+Rp]\mathcal{R} = \sum_t [\mathcal{R}_f + \mathcal{R}_t + \mathcal{R}_m + \mathcal{R}_p]

where terms respectively reward format/correctness, temporal localization, segmentation mask accuracy, and progressive mask improvement. Optimization employs a calibrated Group Relative Policy Optimization (GRPO) objective augmented by auxiliary cross-entropy losses on sampled frame and keyframe segmentations:

L(θ)=LGRPO(π~θ)+λseg(Lspatial+Lkey)\mathcal{L}(\theta) = \mathcal{L}_\text{GRPO}(\tilde\pi_\theta) + \lambda_\text{seg}(\mathcal{L}_\text{spatial} + \mathcal{L}_\text{key})

This approach supplies dense gradient signals both through policy steps and direct mask supervision (Dai et al., 5 Jun 2026).

2. Hierarchical Temporal-Spatial Chain-of-Thought Reasoning

The system enforces a multi-turn, “coarse-to-fine” reasoning protocol:

  1. Initialization: V0\mathcal{V}_0 comprises globally sampled low-resolution frames and sparse spatial frames.
  2. At each step tt, the policy conditions on Vt\mathcal{V}_t0 to select a new temporal interval and keyframe.
  3. Selected frames and reasoning traces are incrementally accumulated.
  4. Once sufficient evidence is detected, or a termination criterion is reached, <answer>[SEG] outputs the final segmentation.

Each turn decomposes the reasoning via a Decoupled Thinking Trace comprising: (i) temporal understanding on low-res context, (ii) spatial analysis of high-res keyframes, and (iii) linguistic fusion for action selection or answer emission. The enforced output schema: Vt\mathcal{V}_t5 This enables the policy network to explicitly structure temporal-spatial exploration and leverage intermediate reasoning supervision.

3. SEG-Aware Logit Calibration for Mask-Sensitive RL

Standard text-based policy optimization gives no direct feedback on segmentation quality. VideoSEG-O3 introduces SEG-aware logit calibration in the policy head:

Vt\mathcal{V}_t1

where Vt\mathcal{V}_t2 averages the pixel-wise likelihoods of the segmentation mask as output from the decoder's logits Vt\mathcal{V}_t3. This direct connection between token-level decisions and pixel-level segmentation fidelity ensures RL gradients reflect true mask accuracy, improving optimization stability and downstream performance (Dai et al., 5 Jun 2026).

4. Decoupled Thinking Trace Architecture

The architecture decomposes the cognitive trace along three axes:

  • Stage A: Temporal Understanding – Operates on up to 20 low-resolution frames (28×28), generating coarse temporal candidates.
  • Stage B: Spatial Detail Capturing – Consumes moderate-resolution spatial frames (128×28×28) and a high-res keyframe (512×28×28) for fine-grained spatial evidence extraction.
  • Stage C: Expression Parsing – Fuses temporal and spatial representations with the query Vt\mathcal{V}_t4 via cross-attention, producing the next action or answer.

This division enables granular supervision, traceable reasoning, and stable long-horizon policy optimization in high-dimensional video RL settings.

5. VTS-CoT Dataset for Cold-Start Chain-of-Thought

A 6,000-sample “VTS-CoT” dataset provides cold-start Chain-of-Thought trajectories. Construction involves:

  • Video curation from ReVOS, MeViS, Long-RVOS.
  • Automated temporal labeling, mask-quality validation, and top-interval selection via LLM prompts.
  • Multi-turn CoT synthesis producing 2–4 turn chains per example, annotated via strict global-to-verification prompt protocols.
  • Annotation format: ordered JSON chain of steps with temporal and spatial selections per step.

VTS-CoT achieves diversity across video lengths, motion dynamics, and linguistic query types, seeding both SFT and RL stages with structured multi-hop reasoning signals (Dai et al., 5 Jun 2026).

6. Experimental Evaluation and Ablation Studies

VideoSEG-O3 demonstrates significant performance gains on standard RVOS and reasoning video benchmarks:

  • Ref-MeViS (J&F): 60.0% (vs. 55.8% for UniPixel-7B)
  • Ref-SAV: 65.5% (vs. 50.0%)
  • Long-RVOS: 57.4% (vs. 51.3%)
  • Reasoning ReVOS (overall): 67.7% (vs. 61.3% for Veason-R1)
  • Zero-shot GroundMoRe: 31.96% (vs. 23.13%)

Ablation studies confirm the contribution of each module:

  • CoT cold-start yields +2.11% average J&F.
  • RL training provides +3.22% J&F and -20% rounds.
  • SEG-calibration improves MeViS by +2.59%, Ref-SAV by +1.49%.
  • Auxiliary spatial and keyframe losses are additive, contributing up to +1.44% J&F improvement.
  • Reward shaping for progressive and keyframe localization is necessary for segment length and accuracy balance.

Bidirectional keyframe propagation in inference offers further +3.12% improvement on MeViS (Dai et al., 5 Jun 2026).

7. Significance and Context Within Video Reasoning Research

VideoSEG-O3 expands the scope of multi-modal video understanding by tightly integrating chain-of-thought reasoning, RL with dense mask feedback, and explicit temporal-spatial action decomposition. Unlike prior frameworks such as Open-o3 Video, which fuses temporal and spatial evidence in structured reasoning for general video QA (Meng et al., 23 Oct 2025), VideoSEG-O3 targets pixel-level RVOS with an explicit multi-turn, selection-based process and a robust mask-aware reinforcement objective. The methods introduced here—particularly SEG-calibrated policy training and tri-modal trace decoupling—provide new paradigms for RL-driven, explainable video reasoning, and segmentation performance at scale.

The framework's evidence-centered multi-turn nature enables dynamic exploration of lengthy, ambiguous, or complex videos, supporting fine-grained reference resolution and enhancing explainability through explicit trace outputs at each step (Dai et al., 5 Jun 2026). This approach demonstrates that joint, iterative policy learning with direct reward signals for every step of reasoning and segmentation can close the gap between language-driven high-level interpretation and low-level visual pixel accuracy.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VideoSEG-O3 Framework.