Papers
Topics
Authors
Recent
Search
2000 character limit reached

PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

Published 26 Mar 2026 in cs.CV | (2603.25398v1)

Abstract: Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.

Summary

  • The paper presents a novel Plain Mask Transformer that leverages frozen Vision Transformer encoders paired with a lightweight decoder for efficient segmentation.
  • Empirical results reveal that PMT matches state-of-the-art accuracy with only a 0.3 PQ drop while delivering up to 3ร— faster inference for image and video segmentation.
  • The study highlights that decoupling task adaptation from the backbone enables multi-task scalability and practical deployment without retraining the frozen encoder.

PMT: A Plain Mask Transformer for Efficient Image and Video Segmentation with Frozen Vision Encoders

Introduction and Motivation

Vision Foundation Models (VFMs), particularly those based on the Vision Transformer (ViT) architecture and pre-trained on massive and diverse datasets (e.g., DINOv2, DINOv3), have become the standard backbone for a wide range of computer vision tasks. The modularity and representational power of these encoders suggest an ideal scenario in which a single frozen feature extractor can serve multiple tasks concurrently, enabling scalable deployment in real-world systems.

However, leading encoder-only segmentation architecturesโ€”most notably Encoder-only Mask Transformer (EoMT) and its video extension VidEoMTโ€”require end-to-end finetuning of the entire ViT encoder, as their query-injection mechanism is fundamentally incompatible with a frozen pre-trained backbone. This necessity for encoder finetuning impedes multi-task sharing and practical applications based on pre-trained VFMs.

The paper introduces the Plain Mask Transformer (PMT), a segmentation architecture that preserves the inference speed and architectural minimalism of encoder-only approaches but is fully compatible with frozen vision encoder backbones. This is accomplished through a lightweight Plain Mask Decoder (PMD), which processes segmentation queries on top of frozen encoder features, leveraging the rich representations produced by modern VFMs.

Methodology

Architectural Overview

PMT consists of two main components:

  1. Frozen VFM Encoder: A ViT pre-trained at scale (e.g., DINOv2, DINOv3), kept completely frozen during downstream task training.
  2. Plain Mask Decoder (PMD): A minimal stack of Transformer layers that mimic the attention and query-patch feature interaction of EoMT's last encoder blocks, but applied entirely after the frozen encoderโ€”never perturbing or augmenting encoder weights or behavior.

Segmentation queriesโ€”learnable tokensโ€”are concatenated with frozen encoder features and processed through this compact Transformer decoder. The architecture fully supports positional encoding via RoPE on the decoder and aggregates multi-scale features using lateral connections from several encoder blocks, further enhancing segmentation accuracy without incurring significant computation or parameter overhead.

Compatibility with Image and Video Segmentation

PMT naturally generalizes from images to untrimmed videos. For video segmentation, temporal context is handled by propagating query representations across frames, using a query fusion mechanism akin to VidEoMT. No explicit tracking modules, temporal context aggregation, or re-identification layers are needed, preserving simplicity and speed.

Overcoming Encoder-Only Limitations

A critical contribution is the empirical demonstration that naive application of the encoder-only paradigm with a frozen VFM encoder yields catastrophic failure: frozen ViTs do not recognize injected queries, resulting in negligible segmentation performance. The introduction of the PMD compensates for this by moving all task-specific adaptation outside the frozen encoder, thereby restoring robustness, accuracy, and modularity.

Experimental Results

Image Segmentation Benchmarks

  • Datasets: COCO (panoptic/instance), ADE20K (semantic)
  • Backbones: DINOv3-Large (ViT-L), DINOv2-Large, ImageNet-21K/1K supervised ViTs
  • Key Results:
    • With DINOv3-L/ViT-L at 640ร—640640\times640, PMT matches the state-of-the-art frozen-encoder approach (ViT-Adapter + Mask2Former) with a mere 0.3 PQ drop, while achieving up to 3ร— higher FPS (141 vs. 48).
    • PMT consistently outpaces architectures with task-specific decoders on both accuracy and speed scaling, as pretraining scale increases and model size grows.

Ablation Analysis

  • Decoder Layers: Increasing PMD depth beyond 6 layers yields negligible improvements, confirming the sufficiency of a shallow decoder stack.
  • Lateral Connections and RoPE: Both contribute minor but nontrivial boosts in segmentation accuracy, especially for weaker pre-trained backbones.
  • Model and Pretrain Scale: PMT is most effective when paired with large VFMs (DINOv3-L, etc.). Performance drops as backbone size and pretraining scale decrease, substantiating the hypothesis that the frozen encoder must output high-quality, generic features.

Video Segmentation Benchmarks

  • VIS (YouTube-VIS): PMT achieves a mAP of 69.2 on YouTube-VIS 2019 with DINOv3, precisely matching or exceeding CAVIS and VidEoMT, while being 8ร— faster than frozen-encoder baselines.
  • VPS (VIPSeg): PMT yields a small performance gap vs. CAVIS (e.g., VPQ drops by up to 1.3), but with massive gains in inference speed and favorable accuracyโ€“efficiency trade-off.
  • VSS (VSPW): PMT reaches a new state-of-the-art on mIoU for video semantic segmentation at 65.7, even while using a frozen encoderโ€”surpassing all finetuned and frozen-encoder baselines.

Implications and Theoretical Perspectives

PMT demonstrates that, given sufficiently rich and generic representations from large pre-trained ViTs, nearly all architectural complexity required for segmentation can be isolated to a lightweight, external post-processing module. This has several notable implications:

  • Multi-task and Multi-dataset Scalability: A single VFM encoder can be reused for diverse segmentation tasks without task-specific finetuning, facilitating deployment in resource-constrained or latency-critical applications.
  • Decoupling Downstream Task Adaptation: Rapid downstream adaptation is possible by altering or augmenting only the PMD, leaving the frozen backbone untouched, further streamlining model management.
  • Efficiency and Pareto Frontier Shifts: Across both image and video domains, PMT consistently shifts the accuracyโ€“latency Pareto frontier in favor of faster models, with no meaningful loss in segmentation quality provided encoder scale is sufficient.

Theoretically, these results reinforce the notion that the inductive biases and knowledge distilled into VFMs at pre-training time are now powerful enough to make downstream task specialization almost entirely decoupled from the encoder. As VFMs continue to scale and improve, one can expect the auxiliary segmentation head (PMD) to become even more compact and universally applicable.

Future Directions

  • Scaling Laws and Pretraining: Systematic exploration of PMT performance as a function of VFM scale, diversity of pretraining, and domain shift is warranted.
  • Extension Beyond Segmentation: The PMT paradigm may generalize to other dense prediction tasks (e.g., keypoint/pose estimation, depth prediction) that similarly benefit from strong frozen visual backbones.
  • Dynamic and Efficient Decoders: Investigating methods for further compressing or dynamically allocating capacity in the PMD, possibly enabling even higher throughput or lower memory footprints.

Conclusion

The Plain Mask Transformer provides a practical, efficient, and general framework for both image and video segmentation on top of frozen Vision Foundation Models. By isolating task-specific adaptation to a shallow Transformer decoder and leveraging lateral connections with multi-scale frozen features, PMT simultaneously achieves state-of-the-art accuracy and unprecedented inference speed. Importantly, it enables true multi-task sharing and amortization of pretraining cost for downstream vision applications. The results signal a paradigm shift in segmentation modeling, driven by the maturing representational quality of VFMs and the move towards maximally reusable visual backbones.

Reference: "PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders" (2603.25398)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 16 likes about this paper.