Video Instance Segmentation (VIS)

Updated 10 December 2025

Video Instance Segmentation (VIS) is a unified task that detects, segments, and tracks object instances over video frames while handling occlusion, appearance changes, and reappearances.
VIS leverages benchmarks like YouTube-VIS with spatio-temporal IoU and mean AP metrics to evaluate performance across diverse real-world challenges.
Recent advances incorporate transformer architectures, box-supervised protocols, and contrastive learning to enhance accuracy and computational efficiency.

Video Instance Segmentation (VIS) is a unified video understanding task requiring joint detection, pixel-level segmentation, and trajectory-level tracking of all object instances over an entire video sequence. For each frame in a video, VIS algorithms must localize and segment every visible instance, assign semantic categories, and maintain correct instance IDs through complex temporal phenomena such as occlusion, appearance changes, new object arrivals, and long-term reappearances. Since its formal introduction and the release of the YouTube-VIS benchmark (Yang et al., 2019), the task has rapidly evolved in architectural methodology, optimization objectives, and annotation paradigms.

1. Problem Definition and Benchmark Foundations

The formal objective of Video Instance Segmentation is: given a video $V = \{I_1, \dots, I_T\}$ , segment and track each object instance over its time span $[p^i, q^i]$ , yielding per-frame binary masks $\mathbf{m}^i_t$ and a consistent category label $c^i \in \mathcal{C}$ per instance. Evaluation is governed by spatio-temporal intersection-over-union (IoU) and mean average precision (AP) across a range of thresholds. The YouTube-VIS dataset is the canonical benchmark (Yang et al., 2019), comprising 2,883 annotated videos, 40 object categories, and exhaustive per-instance tracks. Additional datasets (OVIS, Cityscapes-VPS, VIPSeg) provide coverage for heavier occlusion, panoptic classes, and long videos.

2. Supervision Regimes: Fully-Supervised vs. Box-Supervised VIS

Traditional VIS methods rely on full pixel-level masks, but the annotation burden has driven interest in weakly- (e.g., box-) supervised protocols (Yang et al., 2024). Box-supervised VIS leverages frame-level bounding box labels, which, although much cheaper to collect, lack direct mask supervision, thus requiring mask generation from box cues.

PM-VIS: High-Performance Box-Supervised Video Instance Segmentation presents a two-stage pipeline for mask generation and refinement from boxes (Yang et al., 2024). Key stages:

Pseudo Mask Generation: Three independent mask sources—HQ-SAM (Segment Anything Model with a quality token), IDOL-BoxInst (box-supervised model with projection and affinity losses), and DeAOT (VOS model for track-masks)—generate candidate instance masks.
Mask Selection and Refinement: Optimization mechanisms (SCM, DOOB, SHQM) select, filter, and fuse the highest quality masks, producing a robust pseudo-label set.
Data Filtering: Ground-truth instances with low pseudo-label correlation or missing pseudo-masks are removed, yielding a cleaner training set for fully supervised training.
Architecture: PM-VIS augments IDOL-BoxInst with high-quality pseudo masks and incorporates hybrid losses ( $\mathcal{L}_{cls}, \mathcal{L}_{box}, \mathcal{L}_{BoxInst}, \mathcal{L}_{mask}$ ), closing the gap between box- and pixel-supervised VIS to 1–2 AP points.

3. Core Methods and Algorithmic Strategies

Foundational methods have evolved from extensions of Mask R-CNN:

MaskTrack R-CNN (Yang et al., 2019): Adds a tracking branch and online memory to Mask R-CNN, propagating instance IDs via embedding similarity and appearance, confidence, box IoU, and semantic consistency cues.
Sequence-Level Paradigms: "Propose-Reduce" (Lin et al., 2021) avoids error accumulation in frame/clip-level merging by generating full-sequence proposals from multiple keyframes, then redundant sequence suppression via sequence-level NMS.
Patch Matching and Mask Selection: The Mask Selection Network (MSN) (Goel et al., 2021) employs a patch-based CNN discriminator to select between per-frame segmentation and mask-propagation candidates, improving mask quality in an online pipeline.

Transformer-Based Advancements:

Inter-Frame Communication Transform (IFC) (Hwang et al., 2021): Decomposes space-time attention with efficient memory tokens for inter-frame context propagation, sustaining high AP with reduced computation.
Deformable Transformer VIS (DeVIS) (Caelles et al., 2022): Uses temporal multi-scale deformable attention and instance-aware queries, integrating spatio-temporal reasoning with multi-scale mask heads for joint detection/segmentation/tracking.
InstanceFormer (Koner et al., 2022): Efficient online transformer architecture with prior-propagation (representation/location/class), memory cross-attention for occlusion recovery, and temporal contrastive loss for embedding coherence.

4. Temporal Modeling and Contrastive Optimization

Accurate VIS requires learned representations that quantify the dependencies of object appearance and semantics across time:

Temporal Pyramid Routing (TPR) (Li et al., 2021): Dynamically aligns multi-scale FPN features between frames using deformable convolutions and gating, aggregating pixel-level cues for robust mask and identity propagation.
Decoupled VIS Framework (DVIS) (Zhang et al., 2023): Splits VIS into segmentation, tracking (via a referring denoising transformer), and refinement (temporal convolution, long-term self-attention, cross-attention), achieving superior accuracy and low compute overhead.
Spatio-Temporal Contrastive Learning (STC) (Jiang et al., 2022) and Memory-Contrastive Methods: Bi-directional contrastive losses and temporal consistency regularize instance embeddings, ensuring robust identity tracking and smooth mask evolution.

Spatial cues alone are often insufficient for instance association under occlusion, intersection, or fast motion:

VISAGE (Kim et al., 2023): Integrates pixel-level appearance embeddings (masked pooling over backbone features) with memory bank and appearance-guided matching, yielding superior accuracy in both real and synthetic benchmarks that stress appearance reliance.
Robust Context Fusion (RCF) (Li et al., 2022): Fuses compressed reference context, target frame, and (optionally) audio embeddings via transformer encoder, enabling order-preserving instance codes for matching-free identity association and low-latency inference.

6. Video Pre-Training, Temporal Consistency, and Open-World VIS

Temporal modeling has found further cast in video-centric pre-training (Zhong et al., 22 Mar 2025):

Pseudo-Video Augmentation: Simulates temporally consistent object motion on annotated images, using transformations and morph-splice to yield "video-like" supervision.
Multi-Scale Temporal Modules: Self/cross-attention and ConvGRU blocks capture short/long-term frame correlations, boosting temporal generalization and occlusion robustness.
Open-World VIS (OW-VISFormer) (Thawakar et al., 2023): Introduces feature enrichment and a spatio-temporal objectness module to mine unknown objects, leveraging contrastive losses for flexible category discovery and incremental learning.

7. Performance, Limitations, and Future Directions

Recent methods have steadily advanced the state of VIS:

Method	AP (YTVIS19)	AP (YTVIS21)	AP (OVIS)	Sup.	Comment
MaskTrack R-CNN	30.3	31.7	10.9	Pixel	Two-stage, tracking head
MaskFreeVIS	46.6	40.9	15.7	Box	Box-supervised, earliest mask gen
PM-VIS (pseudo)	48.7	44.6	27.8	Box	High-quality pseudo masks
PM-VIS (filtered GT)	50.0	47.7	29.9	Pixel	Filtered ground-truth, SOTA pixel
DVIS (Swin-L, offline)	64.9	-	49.9	Pixel	Decoupled, light-weight
VISAGE	55.1	51.6	36.2	Pixel	Appearance-guided enhancement

Current limitations include computational overhead for pseudo-label generation (Yang et al., 2024), dependence on mask quality, multi-stage training for refinement, challenges with extremely rapid motion and occlusion, and the difficulty of open-world or zero-shot instance adaptation. Future research is directed at fully end-to-end pseudo-mask generation, efficient spatio-temporal architectures, appearance-guided fusion with other modalities, extension to panoptic/3D segmentation tasks, and the design of pre-training regimes that explicitly model temporal coherence and track identity.

A plausible implication is that, as box-supervised VIS approaches such as PM-VIS reach accuracy parity with pixel-supervised methods, large-scale, cost-efficient video annotation will become practical, enabling further expansion of VIS benchmarks and application domains.