- The paper presents a novel Plain Mask Transformer that leverages frozen Vision Transformer encoders paired with a lightweight decoder for efficient segmentation.
- Empirical results reveal that PMT matches state-of-the-art accuracy with only a 0.3 PQ drop while delivering up to 3ร faster inference for image and video segmentation.
- The study highlights that decoupling task adaptation from the backbone enables multi-task scalability and practical deployment without retraining the frozen encoder.
Introduction and Motivation
Vision Foundation Models (VFMs), particularly those based on the Vision Transformer (ViT) architecture and pre-trained on massive and diverse datasets (e.g., DINOv2, DINOv3), have become the standard backbone for a wide range of computer vision tasks. The modularity and representational power of these encoders suggest an ideal scenario in which a single frozen feature extractor can serve multiple tasks concurrently, enabling scalable deployment in real-world systems.
However, leading encoder-only segmentation architecturesโmost notably Encoder-only Mask Transformer (EoMT) and its video extension VidEoMTโrequire end-to-end finetuning of the entire ViT encoder, as their query-injection mechanism is fundamentally incompatible with a frozen pre-trained backbone. This necessity for encoder finetuning impedes multi-task sharing and practical applications based on pre-trained VFMs.
The paper introduces the Plain Mask Transformer (PMT), a segmentation architecture that preserves the inference speed and architectural minimalism of encoder-only approaches but is fully compatible with frozen vision encoder backbones. This is accomplished through a lightweight Plain Mask Decoder (PMD), which processes segmentation queries on top of frozen encoder features, leveraging the rich representations produced by modern VFMs.
Methodology
Architectural Overview
PMT consists of two main components:
- Frozen VFM Encoder: A ViT pre-trained at scale (e.g., DINOv2, DINOv3), kept completely frozen during downstream task training.
- Plain Mask Decoder (PMD): A minimal stack of Transformer layers that mimic the attention and query-patch feature interaction of EoMT's last encoder blocks, but applied entirely after the frozen encoderโnever perturbing or augmenting encoder weights or behavior.
Segmentation queriesโlearnable tokensโare concatenated with frozen encoder features and processed through this compact Transformer decoder. The architecture fully supports positional encoding via RoPE on the decoder and aggregates multi-scale features using lateral connections from several encoder blocks, further enhancing segmentation accuracy without incurring significant computation or parameter overhead.
Compatibility with Image and Video Segmentation
PMT naturally generalizes from images to untrimmed videos. For video segmentation, temporal context is handled by propagating query representations across frames, using a query fusion mechanism akin to VidEoMT. No explicit tracking modules, temporal context aggregation, or re-identification layers are needed, preserving simplicity and speed.
Overcoming Encoder-Only Limitations
A critical contribution is the empirical demonstration that naive application of the encoder-only paradigm with a frozen VFM encoder yields catastrophic failure: frozen ViTs do not recognize injected queries, resulting in negligible segmentation performance. The introduction of the PMD compensates for this by moving all task-specific adaptation outside the frozen encoder, thereby restoring robustness, accuracy, and modularity.
Experimental Results
Image Segmentation Benchmarks
- Datasets: COCO (panoptic/instance), ADE20K (semantic)
- Backbones: DINOv3-Large (ViT-L), DINOv2-Large, ImageNet-21K/1K supervised ViTs
- Key Results:
- With DINOv3-L/ViT-L at 640ร640, PMT matches the state-of-the-art frozen-encoder approach (ViT-Adapter + Mask2Former) with a mere 0.3 PQ drop, while achieving up to 3ร higher FPS (141 vs. 48).
- PMT consistently outpaces architectures with task-specific decoders on both accuracy and speed scaling, as pretraining scale increases and model size grows.
Ablation Analysis
- Decoder Layers: Increasing PMD depth beyond 6 layers yields negligible improvements, confirming the sufficiency of a shallow decoder stack.
- Lateral Connections and RoPE: Both contribute minor but nontrivial boosts in segmentation accuracy, especially for weaker pre-trained backbones.
- Model and Pretrain Scale: PMT is most effective when paired with large VFMs (DINOv3-L, etc.). Performance drops as backbone size and pretraining scale decrease, substantiating the hypothesis that the frozen encoder must output high-quality, generic features.
Video Segmentation Benchmarks
- VIS (YouTube-VIS): PMT achieves a mAP of 69.2 on YouTube-VIS 2019 with DINOv3, precisely matching or exceeding CAVIS and VidEoMT, while being 8ร faster than frozen-encoder baselines.
- VPS (VIPSeg): PMT yields a small performance gap vs. CAVIS (e.g., VPQ drops by up to 1.3), but with massive gains in inference speed and favorable accuracyโefficiency trade-off.
- VSS (VSPW): PMT reaches a new state-of-the-art on mIoU for video semantic segmentation at 65.7, even while using a frozen encoderโsurpassing all finetuned and frozen-encoder baselines.
Implications and Theoretical Perspectives
PMT demonstrates that, given sufficiently rich and generic representations from large pre-trained ViTs, nearly all architectural complexity required for segmentation can be isolated to a lightweight, external post-processing module. This has several notable implications:
- Multi-task and Multi-dataset Scalability: A single VFM encoder can be reused for diverse segmentation tasks without task-specific finetuning, facilitating deployment in resource-constrained or latency-critical applications.
- Decoupling Downstream Task Adaptation: Rapid downstream adaptation is possible by altering or augmenting only the PMD, leaving the frozen backbone untouched, further streamlining model management.
- Efficiency and Pareto Frontier Shifts: Across both image and video domains, PMT consistently shifts the accuracyโlatency Pareto frontier in favor of faster models, with no meaningful loss in segmentation quality provided encoder scale is sufficient.
Theoretically, these results reinforce the notion that the inductive biases and knowledge distilled into VFMs at pre-training time are now powerful enough to make downstream task specialization almost entirely decoupled from the encoder. As VFMs continue to scale and improve, one can expect the auxiliary segmentation head (PMD) to become even more compact and universally applicable.
Future Directions
- Scaling Laws and Pretraining: Systematic exploration of PMT performance as a function of VFM scale, diversity of pretraining, and domain shift is warranted.
- Extension Beyond Segmentation: The PMT paradigm may generalize to other dense prediction tasks (e.g., keypoint/pose estimation, depth prediction) that similarly benefit from strong frozen visual backbones.
- Dynamic and Efficient Decoders: Investigating methods for further compressing or dynamically allocating capacity in the PMD, possibly enabling even higher throughput or lower memory footprints.
Conclusion
The Plain Mask Transformer provides a practical, efficient, and general framework for both image and video segmentation on top of frozen Vision Foundation Models. By isolating task-specific adaptation to a shallow Transformer decoder and leveraging lateral connections with multi-scale frozen features, PMT simultaneously achieves state-of-the-art accuracy and unprecedented inference speed. Importantly, it enables true multi-task sharing and amortization of pretraining cost for downstream vision applications. The results signal a paradigm shift in segmentation modeling, driven by the maturing representational quality of VFMs and the move towards maximally reusable visual backbones.
Reference: "PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders" (2603.25398)