Encoder-only Mask Transformer (EoMT)
- Encoder-only Mask Transformer (EoMT) is an image segmentation architecture that repurposes a plain Vision Transformer with minimal modifications and extensive masked-image-modeling pre-training.
- It eliminates the need for convolutional adapters, pixel decoders, and separate transformer decoders, drastically reducing computational overhead and boosting inference speed.
- Empirical results show that with sufficient MIM pre-training, EoMT achieves segmentation accuracy nearly on par with complex decoder models while significantly enhancing throughput.
The Encoder-only Mask Transformer (EoMT) is an architectural paradigm for image segmentation that reuses a plain Vision Transformer (ViT) backbone with minimal additional components, eschewing the traditional reliance on convolutional adapters, pixel decoders, and separate Transformer decoders. EoMT demonstrates that, provided sufficiently large models and extensive masked-image-modeling (MIM) pre-training, the ViT architecture itself can learn the inductive biases required for dense prediction tasks such as semantic, panoptic, and instance segmentation. This design achieves segmentation accuracies on par with state-of-the-art models while significantly reducing computational complexity and increasing inference speed.
1. Architectural Description
EoMT modifies the standard ViT architecture in a systematic but minimal fashion. The input image is divided into patches, each embedded with a linear projection and positional encoding. The model is partitioned into two segments: the initial ViT encoder blocks, followed by the concatenation of learnable object query tokens to the patch tokens. The augmented token set is processed jointly in the final ViT blocks, utilizing multi-head self-attention (MHSA) over all tokens.
For dense prediction, each query token generates a class prediction and a mask prediction. Class logits are computed via a linear layer:
where is the number of classes. The mask prediction utilizes an MLP to obtain a mask embedding . The patch tokens are upsampled to a resolution of , denoted , and the predicted mask logits are given by the dot product:
Significantly, query-to-patch cross-attention and query-to-query self-attention both occur within the standard ViT MHSA, obviating the need for separate cross-attention modules. During training, EoMT can mimic Mask2Former's masked cross-attention by selectively blocking attention from query to spatial positions with low activations, but at inference, it uses standard unmasked MHSA for efficiency.
2. Pre-training and Inductive Bias Acquisition
The capability of EoMT to obviate dedicated segmentation-specific architectural components depends critically on large-scale self-supervised or masked-image-modeling-based pre-training. Comparisons in the source work include the use of DINOv2 (self-supervised, MIM-style), EVA-02 (weak-supervised, MIM-style), and the supervised DeiT-III variants (ImageNet-21K and ImageNet-1K). Only the MIM-based Vision Foundation Models (VFMs), specifically DINOv2 and EVA-02, impart sufficient “segmentation bias” for EoMT to match the performance of heavily engineered decoder architectures. Supervised-only pre-training leaves a significant performance deficit.
Training Losses
EoMT leverages the composite loss formulation from Mask2Former, comprising:
- Binary Cross-Entropy (BCE) Mask Loss:
- Dice Loss:
- Cross-Entropy Class Loss:
The total training loss is a weighted sum:
with .
3. Computational Complexity and Throughput
The removal of adapters, pixel decoders, multi-scale pyramids, and transformer decoders eliminates considerable computational overhead and enables the use of highly optimized MHSA implementations, such as FlashAttention 2. The following table presents representative results on COCO val 2017 (640640 input):
| Model Configuration | GFLOPs | FPS | PQ |
|---|---|---|---|
| ViT-Adapter + Mask2Former (ViT-L) | 830 | 29 | 57.1 |
| EoMT (ViT-L, mask annealing, inference) | 669 | 128 | 56.0 |
| ViT-Adapter + Mask2Former (ViT-B) | 347 | 32 | 50.5 |
| EoMT (ViT-B) | 216 | 261 | 50.6 |
| ViT-Adapter + Mask2Former (ViT-S) | 165 | 33 | 50.5 |
| EoMT (ViT-S) | 68 | 330 | 44.7 |
With ViT-L, EoMT achieves a increase in speed over Mask2Former with only a 1.1 PQ drop. Across model scales, the FLOP savings translate to consistently higher FPS, with only marginally reduced segmentation accuracy as model capacity increases.
The asymptotic complexity is as follows:
- Standard ViT: for patch tokens.
- EoMT: , where .
- Mask2Former: incurs an additional due to pixel decoding and masked cross-attention.
4. Segmentation Quality and Scaling
EoMT narrows the segmentation performance gap with heavily engineered decoders as model scale and pre-training robustness increase. The following trends are observed:
| ViT Size | Adapter+M2F PQ | EoMT PQ | FPS (EoMT) |
|---|---|---|---|
| S | 50.5 | 44.7 | ×10 |
| B | 54.4 | 50.6 | ×8 |
| L | 57.1 | 56.0 | ×4.4 |
| g | 57.7 | 57.0 | ×2.7 |
A similar pattern is documented for semantic mIoU (Cityscapes / ADE20K) and instance AP (COCO). As model size increases, EoMT’s drop in segmentation quality versus state-of-the-art decoder-heavy methods diminishes to 1–2% PQ, while throughput remains substantially higher.
5. Practical Recommendations and Design Guidelines
Empirical analysis yields several critical recommendations:
- The ViT backbone, given sufficient MIM pre-training (e.g., DINOv2), is inherently capable of learning dense semantic representations, rendering complex task-specific decoding machinery largely superfluous.
- The minimal architectural changes required are:
- Partition the ViT stack into blocks.
- Append learnable object query tokens after block .
- Jointly process patches and queries through the last blocks.
- Add a compact MLP + dot-product mask head and a linear class head.
- During training, employ “mask annealing”: utilize masked self-attention in the blocks initially, gradually reduce masking probability per block, and disable mask-based attention at inference for optimal speed.
- Training setup includes AdamW optimizer (lr , layerwise decay $0.8$), polynomial lr decay (power $0.9$), staged warm-ups, data augmentations (random flip, scale jitter, color jitter), and specific query and final block counts per configuration ( panoptic/instance, semantic; for ViT-L, for ViT-B, for ViT-g).
- Compute resources are best allocated to scaling the ViT backbone and increasing the scale and quality of self-supervised pre-training, not to architectural complexity in the decoder pipeline.
6. Implications for Segmentation Model Design
The findings associated with EoMT establish that modern, large-scale MIM-pretrained ViT backbones eliminate the necessity for intricate decoder stacks in segmentation tasks. The empirical evidence supports a paradigm in which architectural simplicity, end-to-end trainability, and reliance on robust pre-training yield models that are significantly faster and nearly as accurate as their more complex counterparts. This prompts a reconsideration of the prevailing practice of architectural proliferation in segmentation models, highlighting the value of backbone-centric design and pre-training resource allocation.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free