ReVSeg: Reinforcement Video Segmentation

Updated 7 December 2025

ReVSeg is a video segmentation framework that decomposes VOS into three explicit steps: semantic interpretation, temporal evidence selection, and spatial grounding.
It employs reinforcement learning to optimize a sequential decision process, achieving superior accuracy and robust zero-shot performance.
The approach enhances interpretability by providing explicit, debuggable chain-of-thought outputs and natural language rationales at each step.

ReVSeg is a video object segmentation (VOS) framework that restructures reasoning-centric VOS as an explicit sequence of modular, reinforcement-learned operations conducted using a pretrained vision–LLM (VLM). Unlike previous models that collapse all semantics, temporal reasoning, and spatial grounding into a single latent embedding, ReVSeg decomposes VOS into three VLM-aligned primitives—semantics interpretation, temporal evidence selection, and spatial grounding—optimizing this multi-step reasoning chain with end-to-end reinforcement learning (RL) to enhance interpretability and performance (Li et al., 2 Dec 2025).

1. Motivation and Problem Formulation

Traditional VOS models, when addressing queries involving complex dynamics, causality, or commonsense, typically encode the reasoning process into latent representations (e.g., a special <SEG> token). This approach introduces several limitations:

Opaque and uninterpretable reasoning chains.
High demand for fine-tuning data to map VLM outputs to dense segmentation masks.
Distribution shift from VLM's pretrained textual interface to a spatial mask predictor.

ReVSeg addresses these issues by decomposing reasoning-centric VOS into a structured, sequential decision-making process, operating natively in the VLM interface. This structure allows for transparent intermediate outputs, better alignment with pretrained model capabilities, and reduced sample complexity (Li et al., 2 Dec 2025).

2. Decomposition-Driven Pipeline

The ReVSeg pipeline consists of three explicit reasoning operations, executed as a two-round dialogue with the VLM ( $\mathcal{F}$ ), each step incorporating VLM-native strengths:

Semantics Interpretation: The VLM receives all video frames $V=\{I_1,\dots,I_T\}$ and the textual query $\mathbf{x}$ , prompted with $\mathcal{P}_1$ . It outputs a chain-of-thought description $\mathbf{y}_1$ including both a concise object description $\mathbf{d}$ and a keyframe index $k$ , which are parsed with function $\mathcal{G}$ to extract status ( $S_1$ ), $k$ , and $\mathbf{d}$ .
Temporal Evidence Selection: Occurring within round 1, this step selects the keyframe $I_k$ where the target object is most visible/unambiguous, reducing spatio-temporal localization to a spatial task.
Spatial Grounding: In round 2, the VLM is prompted ( $\mathcal{P}_2$ ) with the selected keyframe $I_k$ , textual description $\mathbf{d}$ , and dialogue history $\mathbf{y}_1$ . The response $\mathbf{y}_2$ is parsed for status ( $S_2$ ) and the predicted bounding box $B_k = (x, y, w, h)$ .

The resulting (keyframe, bounding box) tuple is then processed by an off-the-shelf video tracker ( $\mathcal{T}$ , e.g., SAM2) to yield per-frame segmentations $M = \{m_t\}$ . All reasoning runs in a single VLM session, maintaining complete semantic context across rounds (Li et al., 2 Dec 2025).

3. Reinforcement Learning Optimization

ReVSeg optimizes the entire two-round rollout as a sequential decision policy $\pi_\theta$ , parameterized by the VLM and trained using Group Relative Policy Optimization (GRPO). The policy is defined over the generation of token sequences $\mathbf{y}_1, \mathbf{y}_2$ , with state transitions:

$s_0 = (V, \mathbf{x})$
$s_1 = (s_0, \mathbf{y}_1)$
Trajectory $\tau = (s_1, \mathbf{y}_2)$

A composite, sparse reward function is introduced for robust and interpretable learning:

Format reward $r_f$ : Penalizes malformed outputs or missing fields.
Temporal reward $r_t$ : Encourages selection of frames where the target's ground-truth segmentation mask is large and visible, defined by the normalized area of the mask at frame $k$ .
Spatial reward $r_s$ : Binary; 1 if the predicted box $B_k$ achieves IoU > 0.5 with ground truth.

The total reward for a rollout $\tau$ is: $R(\tau) = r_f + \mathds{1}_{(S_1 = \mathrm{succ})} r_t + \mathds{1}_{(S_1 = \mathrm{succ} \wedge S_2 = \mathrm{succ})} r_s$

The GRPO loss incorporates normalized within-group advantage and a KL-regularization term: $\mathcal{L}(\theta) = \mathbb{E}_{q, \{o_i\} \sim \pi_\theta} \left[ \frac{1}{n} \sum_{i=1}^{n} A_i \right] + \beta\,\mathrm{KL}(\pi_\theta \|\; \pi_{ref}),$ where $A_i = (r_i - \bar{r})/\mathrm{std}(r)$ (Li et al., 2 Dec 2025).

4. Experimental Results

ReVSeg demonstrates superior performance compared to contemporary vision–language reasoning models across multiple video object segmentation benchmarks. Notably, it achieves both high overall accuracy and robustness in zero-shot settings. The following summarizes results as reported (Li et al., 2 Dec 2025):

Table: Performance Metrics on ReasonVOS (zero-shot)

Method	J	F	(J+F)/2
VideoLISA (NeurIPS’24)	45.1	49.9	47.5
RGA-7B (ICCV’25)	51.3	56.0	53.6
CoT-RVS-7B (arXiv’25)	49.5	54.5	52.0
ReVSeg-7B	61.8	67.7	64.8

On the ReVOS held-out benchmark, ReVSeg-7B attained region similarity ( $\mathcal{J}$ ) 59.3, contour accuracy ( $\mathcal{F}$ ) 65.0, and mean 62.1, outperforming VRS-HQ-7B and VISA-13B. For standard Ref-VOS datasets (DAVIS17/MeViS), ReVSeg achieved (J+F)/2 scores of 80.8 and 59.8, respectively.

Ablation studies reveal that both the explicit decomposition and RL optimization substantially improve performance: removing either drastically reduces segmentation quality (e.g., (J+F)/2 = 4.4 on DAVIS when both are absent, versus 80.8 for ReVSeg with both).

5. Qualitative Interpretability and Chain-of-Thought Reasoning

ReVSeg yields explicit, interpretable reasoning trajectories, with intermediate outputs including natural-language rationales and object descriptions. Semantic interpretation steps clarify VLM reasoning (e.g., choosing the leader of a herd based on commonsense), while temporal evidence selection visibly aligns with target visibility in the video. JSON-structured outputs at each step enable systematic debugging and verification, addressing a key limitation of prior latent-embedding approaches (Li et al., 2 Dec 2025).

6. Implementation and Design Details

Vision-LLM ( $\mathcal{F}$ ): Qwen2.5-VL-7B.
Video Tracker ( $\mathcal{T}$ ): SAM2 (Hiera-L).
Training Data: 67,000 video-query pairs from Ref-YouTube-VOS, MeViS, Ref-DAVIS17, ReVOS, and LV-VIS.
Frame Processing: First round on 16 uniformly sampled frames at 448×448 resolution; second round’s keyframe at 840×840; tracker operates at full resolution.
Optimization: Batch size 128 videos, 8 rollouts per prompt, learning rate 1×10⁻⁶, KL weight β=10⁻³; chain-of-thought prompts enforce strict JSON output.

Ablation on the temporal reward design demonstrates that soft, area-based rewards provide measurable gains over binary alternatives.

7. Significance and Extensions

ReVSeg reconceptualizes VOS as a multi-step, interpretable process aligned with VLM strengths, effectively bridging the gap between reasoning requirements and pretrained interface capabilities. This modular decomposition, coupled with RL optimization using sparse, task-aligned rewards, delivers both quantitative and qualitative improvements:

State-of-the-art accuracy on both in-domain and zero-shot VOS tasks.
Fully transparent, debuggable reasoning for each segmentation decision. A plausible implication is that similar decomposition and RL-based modularity can generalize to other multi-modal, reasoning-intensive video understanding tasks (Li et al., 2 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ReVSeg.