PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

Published 26 Mar 2026 in cs.CV | (2603.25398v1)

Abstract: Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a novel Plain Mask Transformer that leverages frozen Vision Transformer encoders paired with a lightweight decoder for efficient segmentation.
Empirical results reveal that PMT matches state-of-the-art accuracy with only a 0.3 PQ drop while delivering up to 3× faster inference for image and video segmentation.
The study highlights that decoupling task adaptation from the backbone enables multi-task scalability and practical deployment without retraining the frozen encoder.

PMT: A Plain Mask Transformer for Efficient Image and Video Segmentation with Frozen Vision Encoders

Introduction and Motivation

Vision Foundation Models (VFMs), particularly those based on the Vision Transformer (ViT) architecture and pre-trained on massive and diverse datasets (e.g., DINOv2, DINOv3), have become the standard backbone for a wide range of computer vision tasks. The modularity and representational power of these encoders suggest an ideal scenario in which a single frozen feature extractor can serve multiple tasks concurrently, enabling scalable deployment in real-world systems.

However, leading encoder-only segmentation architectures—most notably Encoder-only Mask Transformer (EoMT) and its video extension VidEoMT—require end-to-end finetuning of the entire ViT encoder, as their query-injection mechanism is fundamentally incompatible with a frozen pre-trained backbone. This necessity for encoder finetuning impedes multi-task sharing and practical applications based on pre-trained VFMs.

The paper introduces the Plain Mask Transformer (PMT), a segmentation architecture that preserves the inference speed and architectural minimalism of encoder-only approaches but is fully compatible with frozen vision encoder backbones. This is accomplished through a lightweight Plain Mask Decoder (PMD), which processes segmentation queries on top of frozen encoder features, leveraging the rich representations produced by modern VFMs.

Methodology

Architectural Overview

PMT consists of two main components:

Frozen VFM Encoder: A ViT pre-trained at scale (e.g., DINOv2, DINOv3), kept completely frozen during downstream task training.
Plain Mask Decoder (PMD): A minimal stack of Transformer layers that mimic the attention and query-patch feature interaction of EoMT's last encoder blocks, but applied entirely after the frozen encoder—never perturbing or augmenting encoder weights or behavior.

Segmentation queries—learnable tokens—are concatenated with frozen encoder features and processed through this compact Transformer decoder. The architecture fully supports positional encoding via RoPE on the decoder and aggregates multi-scale features using lateral connections from several encoder blocks, further enhancing segmentation accuracy without incurring significant computation or parameter overhead.

Compatibility with Image and Video Segmentation

PMT naturally generalizes from images to untrimmed videos. For video segmentation, temporal context is handled by propagating query representations across frames, using a query fusion mechanism akin to VidEoMT. No explicit tracking modules, temporal context aggregation, or re-identification layers are needed, preserving simplicity and speed.

Overcoming Encoder-Only Limitations

A critical contribution is the empirical demonstration that naive application of the encoder-only paradigm with a frozen VFM encoder yields catastrophic failure: frozen ViTs do not recognize injected queries, resulting in negligible segmentation performance. The introduction of the PMD compensates for this by moving all task-specific adaptation outside the frozen encoder, thereby restoring robustness, accuracy, and modularity.

Experimental Results

Image Segmentation Benchmarks

Datasets: COCO (panoptic/instance), ADE20K (semantic)
Backbones: DINOv3-Large (ViT-L), DINOv2-Large, ImageNet-21K/1K supervised ViTs
Key Results:
- With DINOv3-L/ViT-L at $640\times640$ , PMT matches the state-of-the-art frozen-encoder approach (ViT-Adapter + Mask2Former) with a mere 0.3 PQ drop, while achieving up to 3× higher FPS (141 vs. 48).
- PMT consistently outpaces architectures with task-specific decoders on both accuracy and speed scaling, as pretraining scale increases and model size grows.

Ablation Analysis

Decoder Layers: Increasing PMD depth beyond 6 layers yields negligible improvements, confirming the sufficiency of a shallow decoder stack.
Lateral Connections and RoPE: Both contribute minor but nontrivial boosts in segmentation accuracy, especially for weaker pre-trained backbones.
Model and Pretrain Scale: PMT is most effective when paired with large VFMs (DINOv3-L, etc.). Performance drops as backbone size and pretraining scale decrease, substantiating the hypothesis that the frozen encoder must output high-quality, generic features.

Video Segmentation Benchmarks

VIS (YouTube-VIS): PMT achieves a mAP of 69.2 on YouTube-VIS 2019 with DINOv3, precisely matching or exceeding CAVIS and VidEoMT, while being 8× faster than frozen-encoder baselines.
VPS (VIPSeg): PMT yields a small performance gap vs. CAVIS (e.g., VPQ drops by up to 1.3), but with massive gains in inference speed and favorable accuracy–efficiency trade-off.
VSS (VSPW): PMT reaches a new state-of-the-art on mIoU for video semantic segmentation at 65.7, even while using a frozen encoder—surpassing all finetuned and frozen-encoder baselines.

Implications and Theoretical Perspectives

PMT demonstrates that, given sufficiently rich and generic representations from large pre-trained ViTs, nearly all architectural complexity required for segmentation can be isolated to a lightweight, external post-processing module. This has several notable implications:

Multi-task and Multi-dataset Scalability: A single VFM encoder can be reused for diverse segmentation tasks without task-specific finetuning, facilitating deployment in resource-constrained or latency-critical applications.
Decoupling Downstream Task Adaptation: Rapid downstream adaptation is possible by altering or augmenting only the PMD, leaving the frozen backbone untouched, further streamlining model management.
Efficiency and Pareto Frontier Shifts: Across both image and video domains, PMT consistently shifts the accuracy–latency Pareto frontier in favor of faster models, with no meaningful loss in segmentation quality provided encoder scale is sufficient.

Theoretically, these results reinforce the notion that the inductive biases and knowledge distilled into VFMs at pre-training time are now powerful enough to make downstream task specialization almost entirely decoupled from the encoder. As VFMs continue to scale and improve, one can expect the auxiliary segmentation head (PMD) to become even more compact and universally applicable.

Future Directions

Scaling Laws and Pretraining: Systematic exploration of PMT performance as a function of VFM scale, diversity of pretraining, and domain shift is warranted.
Extension Beyond Segmentation: The PMT paradigm may generalize to other dense prediction tasks (e.g., keypoint/pose estimation, depth prediction) that similarly benefit from strong frozen visual backbones.
Dynamic and Efficient Decoders: Investigating methods for further compressing or dynamically allocating capacity in the PMD, possibly enabling even higher throughput or lower memory footprints.

Conclusion

The Plain Mask Transformer provides a practical, efficient, and general framework for both image and video segmentation on top of frozen Vision Foundation Models. By isolating task-specific adaptation to a shallow Transformer decoder and leveraging lateral connections with multi-scale frozen features, PMT simultaneously achieves state-of-the-art accuracy and unprecedented inference speed. Importantly, it enables true multi-task sharing and amortization of pretraining cost for downstream vision applications. The results signal a paradigm shift in segmentation modeling, driven by the maturing representational quality of VFMs and the move towards maximally reusable visual backbones.

Reference: "PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders" (2603.25398)

Markdown Report Issue