Video Panoptic Segmentation

Updated 16 October 2025

Video Panoptic Segmentation (VPS) is a comprehensive video analysis technique that assigns each pixel a semantic label and a temporally consistent instance ID for both objects and background.
It employs advanced modules like pixel-level feature fusion and learned object tracking to maintain accuracy and robust cross-frame association.
VPS drives innovations in autonomous driving, augmented reality, and video editing, supported by benchmark datasets such as VIPER and Cityscapes-VPS.

Video Panoptic Segmentation (VPS) is a unified video understanding task that extends panoptic segmentation to the temporal domain by jointly addressing semantic segmentation, instance segmentation, and tracking of all “thing” and “stuff” categories across video frames. The objective is to assign each video pixel a class label and a temporally consistent instance identifier, thereby segmenting both foreground objects (things) and background regions (stuff) and tracking object instances across time. VPS provides a holistic, temporally coherent, and fine-grained scene decomposition especially relevant to autonomous driving, augmented reality, video editing, and robotics.

1. Definition and Formulation

In VPS, for each frame in a sequence, the output is a panoptic segmentation map where every pixel is assigned:

a semantic class label (for both things and stuff)
an instance ID (for each “thing,” with IDs consistent across frames)

Let the prediction for frame $t$ be $S_t = \{(c_{u,t},\, id_{u,t})\,|\, u \in \Omega\}$ where $c_{u,t}$ is the class label at pixel $u$ and $id_{u,t}$ is the instance identifier, with the constraint that for any object instance, the $id$ remains consistent across all frames in which the physical object appears.

The key challenge distinguishing VPS from image-based panoptic segmentation is the explicit requirement for temporal consistency of instance IDs over an entire video, forming object “tubes” in the spatiotemporal volume. This means each segmented object needs to be matched and tracked across frames, with identity maintained even under occlusion, deformation, or scene transformation.

2. Benchmark Datasets

To encourage research in VPS, two types of datasets were introduced (Kim et al., 2020):

Dataset	Source/Type	Categories	Annotation Scope	Unique Features
VIPER	Synthetic (GTA-V engine)	10 things, 13 stuff	254K frames, high-res	Reorganized for VPS; dense COCO-style meta
Cityscapes-VPS	Real urban street scenes	8 things, 11 stuff	500 videos, 1024×2048	Extends Cityscapes to video; new semi-dense panoptic labels

VIPER was reorganized into video tubes using consistent instance mapping. Cityscapes-VPS generated new annotations not only for the original reference frames but for every fifth frame, providing high-quality temporally-linked pixel-wise labels for both dynamic (thing) and static (stuff) categories.

These datasets enable training and robust benchmarking for VPS on both synthetic and challenging real-world urban environments.

3. Network Architectures and Temporal Association

The canonical architecture for VPS is VPSNet (Kim et al., 2020). VPSNet extends image panoptic segmentation networks (such as UPSNet and those built on Mask R-CNN/FPN backbones) by jointly addressing both spatial and temporal cues:

Pixel-Level Feature Fusion (Fuse module):
- Extracts multi-scale image features using FPN.
- Aligns and warps reference frame features to the target frame using optical flow (FlowNet2), producing initial alignment.
- Aggregates temporally aligned, multi-scale features via spatial-temporal attention (attend block), producing enhanced fused features $g_t$ .
Object-Level Tracking (Track module):
- Region-of-interest features are embedded (Siamese style) for RoIs in target and reference frames.
- Cosine similarity $A_{ij} = \cos(e_{i,t}, e_{j,t-\tau})$ is used to build affinities between object proposals, allowing learned matching of instance IDs across frames.
- During inference, further matching uses panoptic head logits' IoU for validation.

Training uses combined losses over semantic, instance, and tracking heads. The architecture allows both improved framewise segmentation and robust cross-frame instance association.

Alternative architectures: Other lines of work explored set prediction transformer decoders (Ryu et al., 2021), object-centric panoptic slot learning (Zhou et al., 2021), and hybrid pixel/instance trackers (Ye et al., 2022), each adopting different temporal association and tracking strategies. Recent transformer-based decoupled pipelines, notably Mask2Former+DVIS/DVIS++ with ViT adapter backbones, have become dominant for large-scale VPS challenges.

4. Evaluation Metrics

Evaluating VPS requires metrics that reflect both frame-level segmentation quality and temporal consistency of instance assignments.

The paper introduces Video Panoptic Quality (VPQ) (Kim et al., 2020). For a temporal window size $k$ , VPQ is defined as:

$VPQ^k = \frac{\sum_{c} \sum_{(u,\hat{u}) \in TP_c} IoU(u,\hat{u})} {\sum_{c} (|TP_c| + \frac{1}{2}|FP_c| + \frac{1}{2}|FN_c|)}$

$TP_c$ , $FP_c$ , and $FN_c$ are true positive, false positive, and false negative tube pairs for category $c$ , with matching IoU $>0.5$ .
Tube means a path of matched segments for one object or region across $k$ consecutive frames.
VPQ is averaged over multiple window sizes (e.g., $k\in\{0,5,10,15\}$ ; $k=0$ recovers regular PQ).

VPQ is stringent: a single missed or mismatched frame disrupts the tube, appropriately penalizing short-term tracking errors and enforcing high standards for both segmentation and cross-frame association.

5. Experimental Findings and Design Tradeoffs

On Cityscapes-VPS and VIPER, VPSNet outperformed frame-based panoptic baselines in both PQ and VPQ (Kim et al., 2020).

Fuse module (temporal feature fusion) increased PQ by ≈1%.
Pretraining on VIPER further improved accuracy on Cityscapes-VPS.
The full model (Fuse + learned Track) improved VPQ by up to 17% over per-frame baselines.
Naive tracking strategies (classification scores, IoU-only) performed substantially worse than the learning-based Track module, which leverages both appearance and geometric cues.

Ablation showed:

Temporal fusion benefits even per-frame segmentation, suggesting effective multi-frame context aggregation.
Precise tracking affinity learning is more critical as temporal window size increases.

The modular design allows tradeoffs:

Adding Fuse/Track modules increases computational cost but is essential for temporal stability.
More aggressive temporal windows improve long-range consistency but can amplify error propagation if instance tracking is weak.

6. Applications and Public Contributions

VPS has significant impact on several domains where both spatial and temporal scene understanding are required:

Autonomous driving: Enables temporally consistent labeling of infrastructure, vehicles, and pedestrians.
Augmented reality: Secures object identities throughout a video for persistent overlays.
Video editing: Facilitates object-level tracking and replacement over arbitrary video durations.

The public release of VIPER and Cityscapes-VPS, with established metrics/code (Kim et al., 2020), provides a standardized foundation for comparison and further research—critical for advancing collaboratively in the field.

7. Future Directions

Several avenues have emerged:

Scaling architectures: Accommodating increased video lengths and higher-resolution streams without losing temporal coherence.
Self-supervised/weak supervision: Reducing annotation cost via leveraging unlabeled or sparsely labeled video.
Unified 3D scene recovery: Integrating VPS with geometric perception (e.g., monocular depth, visual odometry) for full 4D scene understanding (Qiao et al., 2020, Ye et al., 2022, Yang et al., 14 Oct 2025).
Robustness: Handling occlusions, rare events, and camera artifacts in real-world video.
Multi-modal fusion: Combining RGB, LiDAR, stereo, and other sensor inputs to further increase performance in challenging domains (Ayar et al., 30 Dec 2024).

Continued research explores end-to-end transformer architectures, scalable panoptic slots, and decoupled pipelines with strong backbone features—even up to foundation model scale (e.g., DINOv2 ViT-g + adapters)—reinforcing VPS as a frontier for holistic, temporally stable video scene interpretation.