Referring Video Object Segmentation

Updated 9 October 2025

Referring Video Object Segmentation (RVOS) is the task of segmenting objects in videos based on natural language expressions, emphasizing cross-modal alignment and temporal consistency.
It integrates vision and language processing via dual-branch models, transformer architectures, and prompt-driven frameworks to achieve robust, real-time segmentation.
Recent advancements address challenges such as disambiguation, efficient supervision, and temporal modeling, paving the way for scalable and interpretable applications.

Referring Video Object Segmentation (RVOS) is the task of segmenting objects in video sequences as specified by natural language descriptions. The objective is to generate temporally coherent binary masks for the target object(s) throughout a video, conditioning the segmentation on free-form, expressive queries. This paradigm extends video object segmentation from pixel-wise tracking to language-grounded spatial reasoning, presenting unique cross-modal and temporal challenges. RVOS has become a focal research area bridging computer vision and natural language processing, with substantial methodological innovations, tailored benchmarks, and the emergence of scalable, multimodal architectures.

1. Problem Formulation and Scope

RVOS is defined as: given a video $\mathcal{V} = \{I_t\}_{t=1}^{T}$ and a referring expression (RE) $\mathcal{S}$ , produce a sequence of masks $\{\mathcal{M}_t\}_{t=1}^{T}$ where $\mathcal{M}_t$ segments the object in frame $I_t$ referred to by $\mathcal{S}$ . Unlike semi-supervised VOS, which assumes object masks in the first frame, RVOS requires grounding the object solely through language cues and the video sequence. The task encapsulates several nontrivial factors:

Cross-modal alignment: RVOS models must bridge the semantic gap between textual reference and visual content, parsing language for object category, appearance, spatial location, and dynamic cues.
Temporal consistency: The segmentation must remain robust despite object motion, occlusion, and appearance changes across frames.
Disambiguation: When multiple similar objects are present, the system must resolve references using subtle cues (spatial, appearance, action) in the RE.

RVOS thus serves as a testbed for complex video-language reasoning, semantic grounding, and long-term spatiotemporal modeling.

2. Architectural Approaches and Model Design

RVOS models have evolved from dual-branch convolutional designs to unified transformer-based and prompt-driven frameworks:

Early Dual-Branch Models: RefVOS (Bellver et al., 2020) exemplifies early architectures, using DeepLabv3 with a ResNet101 visual backbone, BERT language encoding, and element-wise multiplication for fusion. Temporal modeling is absent—frames are handled independently—but the approach achieves competitive results via rich language understanding and simple, effective late fusion.
Unified Transformer Models: With the advent of transformer architectures, models such as ReferFormer, FTEA (Li et al., 2023), and SOC (Luo et al., 2023) integrate joint space learning for video-level visual-linguistic alignment. These models use multi-scale visual encoders (e.g., Video Swin Transformer), transformer-based language encoders (e.g., RoBERTa), and multi-modal cross-attention to produce instance-specific queries. Stacked attention and mask sequence learning in FTEA treat RVOS as a sequence classification problem, creating diverse candidate masks and matching via referring scores.
Object Cluster and Aggregation Frameworks: SOC clusters object representations across the entire video, using a two-stage temporal modeling architecture and contrastive learning to maintain cross-frame consistency and semantic alignment. This enables effective reasoning about temporal actions or state variations mentioned in REs.
Prompt-based and Modular Paradigms: Recent frameworks (e.g., Tenet (Lin et al., 8 Oct 2025), GroPrompt (Lin et al., 18 Jun 2024)) decouple the segmentation head from object grounding. They employ foundation segmentation models (e.g., SAM) guided by temporal prompts derived from object detectors/tracking pipelines, and use additional modules (e.g., Prompt Preference Learning, TAP-CL) to select or refine the prompts, demonstrating scalability and avoidance of expensive dense mask annotation.
Visual Grounding Foundations: ReferDINO (Liang et al., 24 Jan 2025) leverages region-level vision-language alignment from pretrained detection models (GroundingDINO) and extends them with deformable mask decoding and temporal consistency modules. This fusion of spatially precise visual grounding and spatiotemporal reasoning yields marked performance gains and real-time capability.
LLM-Driven and Training-Free Reasoning: PARSE-VOS (Zhao et al., 6 Sep 2025) introduces a hierarchical, LLM-powered, training-free inference process, parsing REs into semantic commands and combining modular object tracking/segmentation with LLM-based coarse-to-fine trajectory scoring, providing robust generalization without end-to-end video segmentation pretraining.

3. Challenges in Referring Expression Understanding and Benchmarking

A foundational insight from RefVOS (Bellver et al., 2020) is the predominance of “trivial” referring expressions in existing benchmarks such as DAVIS-2017 and A2D. These are cases where the object can be identified solely from its category name, as it is unique in the scene. Conversely, “non-trivial” REs emerge when multiple instances of an object class demand further clues for disambiguation (appearance, location, motion, interactions). The authors introduce a detailed semantic taxonomy:

Category	Description
category	Object class or noun (e.g., "man," "dog")
appearance	Visual attributes (e.g., "in a red shirt")
location	Spatial context (e.g., "on the left")
motion	Dynamic cues (e.g., "walking forward")
obj-motion	Object the referent sets in motion ("riding a bike")
static	Static action/status ("sitting")
obj-static	Static interaction with another object ("holding a cup")

Empirical studies (Bellver et al., 2020) reveal that existing datasets are dominated by category/appearance cues, with motion, location, or interaction-based REs severely underrepresented. Models that perform well overall may fail on these non-trivial, semantically demanding cases. This motivates new benchmarks (e.g., Long-RVOS (Liang et al., 19 May 2025)) and task variants that expose temporal occlusion, long video duration, and compositional language, and encourages evaluation protocols that stratify results across RE types.

4. Temporal Modeling and Consistency

A repeated challenge in RVOS is temporal consistency—preserving coherent segmentation across frames in the presence of object motion, occlusion, and dynamic scene evolution. Approaches include:

Inter-Frame Reasoning: Models such as BIFIT (Lan et al., 2023) employ plug-and-play transformer modules inserted into the decoder to explicitly model inter-frame interactions, improving mask stability and temporal correspondence.
Video-Level Coherence: SOC (Luo et al., 2023) and hybrid memory frameworks (HTR (Miao et al., 28 Mar 2024)) aggregate object-level embeddings across frames using temporal attention and clustering, ensuring instance identity survived through complex video events.
Propagation and Keyframe Strategies: Competition-winning solutions ensemble per-frame masks with post-processing steps such as AOT (Hu et al., 2022), using temporally consistent keyframes as propagation seeds to ameliorate mask jitter and drift.
Long-Term Consistency Benchmarks: Long-RVOS (Liang et al., 19 May 2025) introduces explicit temporal IoU (tIoU) and spatiotemporal volume IoU (vIoU) metrics, revealing that temporal inconsistency is a major failure mode for state-of-the-art systems trained only on short clips, and guiding the development of local-to-global architectures that integrate motion vectors and aligned keyframes.

5. Training Paradigms, Annotation, and Supervision Strategies

Full pixel-level annotation for every frame is prohibitively expensive. Several strategies emerge:

Weak Supervision and Efficient Annotation: SimRVOS (Zhao et al., 2023) proposes labeling only the first occurrence of a mask, with subsequent frames marked by bounding boxes, reducing annotation effort by a factor of approximately eight. Cross-frame language-guided dynamic filters and bi-level contrastive learning compensate for the reduced supervision and narrow the performance gap to fully supervised models.
Semi-supervised Learning: Cyclical learning rates, training with pseudo labels, and test-time augmentations enable strong improvements (cf. ReferFormer+ in (Cao et al., 2022)), with iterative pseudo-label refinement yielding marked boosts in standard benchmarks.
Prompt Learning and Foundation Model Adaptation: GroPrompt (Lin et al., 18 Jun 2024) and Tenet (Lin et al., 8 Oct 2025) formulate RVOS as a prompt selection/optimization problem, sidestepping expensive mask supervision. Temporal prompt selection employs ranking losses and contrastive learning to align the temporal sequence of position prompts to both the referring sentence and the global context, leveraging pre-trained segmentation networks for the pixel-level mask task.

6. Empirical Results and Performance Metrics

Leading models achieve strong results on challenging datasets:

Model	Dataset	Metric	Performance
RefVOS	DAVIS-2017	JF (J region, F contour)	Outperformed Khoreva et al. with several points gain (Bellver et al., 2020)
SOC	Ref-Youtube-VOS	JF Avg	+3.0% over previous SOTA (Luo et al., 2023)
ReferDINO	Ref-YouTube-VOS	JF Avg	69.3% (Swin-B backbone) +3.9% over SOTA
Tenet	Ref-Youtube-VOS	J&F	65.5% with only 45M trainable params (Lin et al., 8 Oct 2025)
HTR	Ref-YouTube-VOS	JF Avg	67.1% (Swin-B/L) (Miao et al., 28 Mar 2024)
FTEA	Ref-YouTube-VOS	JF Avg	56.6% (Li et al., 2023)

Performance is generally assessed via region similarity (J), contour accuracy (F), per-frame IoU, as well as bespoke metrics for temporal consistency (MCS (Miao et al., 28 Mar 2024), tIoU and vIoU (Liang et al., 19 May 2025)). Model efficiency, real-time inference speed (e.g., 51 FPS in ReferDINO with query pruning), and robustness to occlusion, appearance variation, and corrupted data are increasingly reported.

7. Open Problems and Future Directions

Key avenues for advancing RVOS include:

Richer Benchmarking and Dataset Diversity: Moving beyond trivial REs to evaluate motion, interaction, and long-video scenarios (Bellver et al., 2020, Liang et al., 19 May 2025).
Temporal and Cross-modal Modeling: Improved fusion (e.g., graph, joint-global structures), adaptive token allocation and spatiotemporal compression (Zhang et al., 28 Sep 2025), and localized temporal reasoning to mitigate mask drift.
Robustness and Negative Samples: Explicitly handling false (unpaired) video–text pairs and ensuring low false-positive rates (Li et al., 2022).
Efficient Supervision and Annotation: Further reducing reliance on dense masks, exploring point-level or box-level labels, and leveraging language-vision pretraining for domain transfer (Zhao et al., 2023, Zhou et al., 17 May 2024).
Leveraging LLMs and Hierarchical Reasoning: Training-free, compositional, and explainable RVOS by harnessing LLMs for semantic parsing and multi-stage candidate evaluation (Zhao et al., 6 Sep 2025).
Scalable and Universal Frameworks: Unified models that adapt to variable-length, multi-object, and few-shot settings with robust generalization.

A plausible implication is that future RVOS systems will not only require advances in temporal and semantic reasoning but also hinge on scalable supervision, compositional language grounding, and the ability to operate efficiently under long-form video conditions and complex REs. The ongoing trend toward modular, prompt-based pipelines—disentangling localization, temporal association, and segmentation—suggests an increasing reliance on foundation models and flexible, interpretable reasoning strategies.