Papers
Topics
Authors
Recent
2000 character limit reached

Encoder-only Mask Transformer (EoMT)

Updated 16 November 2025
  • Encoder-only Mask Transformer (EoMT) is an image segmentation architecture that repurposes a plain Vision Transformer with minimal modifications and extensive masked-image-modeling pre-training.
  • It eliminates the need for convolutional adapters, pixel decoders, and separate transformer decoders, drastically reducing computational overhead and boosting inference speed.
  • Empirical results show that with sufficient MIM pre-training, EoMT achieves segmentation accuracy nearly on par with complex decoder models while significantly enhancing throughput.

The Encoder-only Mask Transformer (EoMT) is an architectural paradigm for image segmentation that reuses a plain Vision Transformer (ViT) backbone with minimal additional components, eschewing the traditional reliance on convolutional adapters, pixel decoders, and separate Transformer decoders. EoMT demonstrates that, provided sufficiently large models and extensive masked-image-modeling (MIM) pre-training, the ViT architecture itself can learn the inductive biases required for dense prediction tasks such as semantic, panoptic, and instance segmentation. This design achieves segmentation accuracies on par with state-of-the-art models while significantly reducing computational complexity and increasing inference speed.

1. Architectural Description

EoMT modifies the standard ViT architecture in a systematic but minimal fashion. The input image is divided into patches, each embedded with a linear projection and positional encoding. The model is partitioned into two segments: the initial L1L_1 ViT encoder blocks, followed by the concatenation of KK learnable object query tokens to the NN patch tokens. The augmented token set is processed jointly in the final L2L_2 ViT blocks, utilizing multi-head self-attention (MHSA) over all N+KN+K tokens.

For dense prediction, each query token qiq^i generates a class prediction and a mask prediction. Class logits are computed via a linear layer:

ci=WpqiRCc^i = W_p q^i \in \mathbb{R}^C

where CC is the number of classes. The mask prediction utilizes an MLP to obtain a mask embedding q^i=MLP(qi)Rd\hat{q}^i = \mathrm{MLP}(q^i) \in \mathbb{R}^d. The patch tokens are upsampled to a resolution of H/4×W/4H/4 \times W/4, denoted F4Rd×(H/4W/4)F_4 \in \mathbb{R}^{d \times (H/4 \cdot W/4)}, and the predicted mask logits are given by the dot product:

mi(x,y)=q^i,F4(:,x,y)=q^iTF4(:,x,y)m^i(x, y) = \langle \hat{q}^i, F_4(:, x, y) \rangle = \hat{q}^i{}^T F_4(:, x, y)

Significantly, query-to-patch cross-attention and query-to-query self-attention both occur within the standard ViT MHSA, obviating the need for separate cross-attention modules. During training, EoMT can mimic Mask2Former's masked cross-attention by selectively blocking attention from query ii to spatial positions (x,y)(x, y) with low mi(x,y)m^i(x, y) activations, but at inference, it uses standard unmasked MHSA for efficiency.

2. Pre-training and Inductive Bias Acquisition

The capability of EoMT to obviate dedicated segmentation-specific architectural components depends critically on large-scale self-supervised or masked-image-modeling-based pre-training. Comparisons in the source work include the use of DINOv2 (self-supervised, MIM-style), EVA-02 (weak-supervised, MIM-style), and the supervised DeiT-III variants (ImageNet-21K and ImageNet-1K). Only the MIM-based Vision Foundation Models (VFMs), specifically DINOv2 and EVA-02, impart sufficient “segmentation bias” for EoMT to match the performance of heavily engineered decoder architectures. Supervised-only pre-training leaves a significant performance deficit.

Training Losses

EoMT leverages the composite loss formulation from Mask2Former, comprising:

  • Binary Cross-Entropy (BCE) Mask Loss:

Lbce=1HWx,y[yi(x,y)logm^i(x,y)+(1yi(x,y))log(1m^i(x,y))]\mathcal{L}_{\mathrm{bce}} = -\frac{1}{HW} \sum_{x, y} \Bigl[ y_i(x, y) \log \hat{m}^i(x, y) + (1 - y_i(x, y)) \log(1 - \hat{m}^i(x, y)) \Bigr]

  • Dice Loss:

Ldice=12x,yyi(x,y)m^i(x,y)x,yyi(x,y)+x,ym^i(x,y)+ϵ\mathcal{L}_{\mathrm{dice}} = 1 - \frac{2 \sum_{x, y} y_i(x, y) \hat{m}^i(x, y)}{ \sum_{x, y} y_i(x, y) + \sum_{x, y} \hat{m}^i(x, y) + \epsilon }

  • Cross-Entropy Class Loss:

Lce=cyclogSoftmax(ci)\mathcal{L}_{\mathrm{ce}} = - \sum_c y^c \log \mathrm{Softmax}(c^i)

The total training loss is a weighted sum:

Ltot=λbceLbce+λdiceLdice+λceLce\mathcal{L}_{\text{tot}} = \lambda_{\mathrm{bce}} \mathcal{L}_{\mathrm{bce}} + \lambda_{\mathrm{dice}} \mathcal{L}_{\mathrm{dice}} + \lambda_{\mathrm{ce}} \mathcal{L}_{\mathrm{ce}}

with λbce=5.0,λdice=5.0,λce=2.0\lambda_{\mathrm{bce}}=5.0, \lambda_{\mathrm{dice}}=5.0, \lambda_{\mathrm{ce}}=2.0.

3. Computational Complexity and Throughput

The removal of adapters, pixel decoders, multi-scale pyramids, and transformer decoders eliminates considerable computational overhead and enables the use of highly optimized MHSA implementations, such as FlashAttention 2. The following table presents representative results on COCO val 2017 (640×\times640 input):

Model Configuration GFLOPs FPS PQ
ViT-Adapter + Mask2Former (ViT-L) 830 29 57.1
EoMT (ViT-L, mask annealing, inference) 669 128 56.0
ViT-Adapter + Mask2Former (ViT-B) 347 32 50.5
EoMT (ViT-B) 216 261 50.6
ViT-Adapter + Mask2Former (ViT-S) 165 33 50.5
EoMT (ViT-S) 68 330 44.7

With ViT-L, EoMT achieves a 4×4\times increase in speed over Mask2Former with only a 1.1 PQ drop. Across model scales, the FLOP savings translate to consistently higher FPS, with only marginally reduced segmentation accuracy as model capacity increases.

The asymptotic complexity is as follows:

  • Standard ViT: O(N2d)\mathcal{O}(N^2 d) for NN patch tokens.
  • EoMT: O((N+K)2d)\mathcal{O}((N+K)^2 d), where KNK \ll N.
  • Mask2Former: incurs an additional O(NCH)\mathcal{O}(NCH) due to pixel decoding and masked cross-attention.

4. Segmentation Quality and Scaling

EoMT narrows the segmentation performance gap with heavily engineered decoders as model scale and pre-training robustness increase. The following trends are observed:

ViT Size Adapter+M2F PQ EoMT PQ FPS (EoMT)
S 50.5 44.7 ×10
B 54.4 50.6 ×8
L 57.1 56.0 ×4.4
g 57.7 57.0 ×2.7

A similar pattern is documented for semantic mIoU (Cityscapes / ADE20K) and instance AP (COCO). As model size increases, EoMT’s drop in segmentation quality versus state-of-the-art decoder-heavy methods diminishes to 1–2% PQ, while throughput remains substantially higher.

5. Practical Recommendations and Design Guidelines

Empirical analysis yields several critical recommendations:

  1. The ViT backbone, given sufficient MIM pre-training (e.g., DINOv2), is inherently capable of learning dense semantic representations, rendering complex task-specific decoding machinery largely superfluous.
  2. The minimal architectural changes required are:
    • Partition the ViT stack into L1+L2L_1 + L_2 blocks.
    • Append KK learnable object query tokens after block L1L_1.
    • Jointly process patches and queries through the last L2L_2 blocks.
    • Add a compact MLP + dot-product mask head and a linear class head.
  3. During training, employ “mask annealing”: utilize masked self-attention in the L2L_2 blocks initially, gradually reduce masking probability per block, and disable mask-based attention at inference for optimal speed.
  4. Training setup includes AdamW optimizer (lr 1×1041 \times 10^{-4}, layerwise decay $0.8$), polynomial lr decay (power $0.9$), staged warm-ups, data augmentations (random flip, scale jitter, color jitter), and specific query and final block counts per configuration (K=200K=200 panoptic/instance, K=100K=100 semantic; L2=4L_2=4 for ViT-L, L2=3L_2=3 for ViT-B, L2=5L_2=5 for ViT-g).
  5. Compute resources are best allocated to scaling the ViT backbone and increasing the scale and quality of self-supervised pre-training, not to architectural complexity in the decoder pipeline.

6. Implications for Segmentation Model Design

The findings associated with EoMT establish that modern, large-scale MIM-pretrained ViT backbones eliminate the necessity for intricate decoder stacks in segmentation tasks. The empirical evidence supports a paradigm in which architectural simplicity, end-to-end trainability, and reliance on robust pre-training yield models that are significantly faster and nearly as accurate as their more complex counterparts. This prompts a reconsideration of the prevailing practice of architectural proliferation in segmentation models, highlighting the value of backbone-centric design and pre-training resource allocation.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Encoder-only Mask Transformer (EoMT).