Papers
Topics
Authors
Recent
2000 character limit reached

OneFormer: Unified Transformer for Segmentation

Updated 7 December 2025
  • The paper introduces a unified transformer model that outperforms specialized Mask2Former models by achieving state-of-the-art results on semantic, instance, and panoptic segmentation tasks.
  • OneFormer employs task-conditioned queries and multi-scale deformable attention to flexibly switch output domains with a single text prompt during inference.
  • The framework reduces computational resources by consolidating three separate models into one 219M-parameter architecture validated on ADE20K, Cityscapes, and COCO.

OneFormer is a single-model, transformer-based framework for universal image segmentation that achieves state-of-the-art performance on semantic, instance, and panoptic segmentation tasks within a unified architecture and training process. Unlike prior methods that require specialized models and separately trained architectures for each segmentation paradigm, OneFormer introduces a train-once, task-conditioned approach capable of dynamically switching its output domain simply by varying a text prompt at inference. OneFormer is the first segmentation model to simultaneously outperform three specialized Mask2Former models trained individually on ADE20K, Cityscapes, and COCO, using a single set of model parameters and substantially reduced computational resources (Jain et al., 2022).

1. Unified Architecture and Segmentation Pipeline

OneFormer utilizes a ViT-style backbone, supporting Swin-L, ConvNeXt-L, and DiNAT-L variants, to extract multi-scale features at input resolutions of 1/4, 1/8, 1/16, and 1/32. These features are fused by a pixel decoder based on Multi-Scale Deformable Attention, upsampling to a 1/4-scale feature map F1/4F_{1/4}. The framework is conditioned on a text prompt—“the task is {panoptic/instance/semantic}”—which is tokenized and mapped via a small transformer to produce a D-dimensional task token Qtask\mathbf Q_{\text{task}}.

Object queries are initialized by repeating Qtask\mathbf Q_{\text{task}} and processed by self-/cross-attention, yielding image-aware, task-conditioned queries. These are concatenated with Qtask\mathbf Q_{\text{task}} itself to form NN task-conditioned object queries Q\mathbf Q. A multi-layer transformer decoder alternates masked cross-attention, self-attention, and MLP operations, attending these queries to the multi-scale pixel features. Each decoder output head predicts (K+1)(K+1)-class scores (with KK the number of dataset classes, and “+1” for “no-object”) and binary masks, computed as an einsum between learned query embeddings and F1/4F_{1/4}. The inference process mirrors Mask2Former, including thresholding and post-processing, to yield panoptic, instance, or semantic segmentation outputs depending on the prompt (Jain et al., 2022).

2. Task-Conditioned Multi-Task Training

OneFormer eschews the standard paradigm of training a separate model for each segmentation type. Instead, it is trained once using only panoptic annotations. For each training iteration, a segmentation task is uniformly sampled (p=1/3p=1/3 each): panoptic, semantic, or instance. From the panoptic ground truth, the method derives masks corresponding to each task specification:

  • Semantic: single mask per class, including “stuff” and “thing” categories
  • Instance: per-instance masks for “things” only
  • Panoptic: both “thing” masks and amorphous “stuff” masks

A text entry list Tlist\mathbf T_{\mathrm{list}} is rendered for each ground-truth mask as “a photo with a {CLS}”, padded to NtextN_{\text{text}} with “a/an {task} photo” for “no-object”; these are tokenized/encoded to text queries Qtext\mathbf Q_{\text{text}}. The instruction prompt “the task is {task}” is encoded as Qtask\mathbf Q_{\text{task}}, ensuring that both query initialization and decoder operation are conditioned appropriately. At inference, changing this prompt directly modulates the segmentation domain, with outputs for instance, semantic, or panoptic segmentation (Jain et al., 2022).

3. Query–Text Contrastive Loss and Optimization

To enforce discrimination between objects and task-modes, OneFormer introduces a bidirectional query–text contrastive loss. For matched object/text query pairs in a batch of size BB, the forward and backward contrastive losses are:

LQ ⁣ ⁣Qt=1Bi=1Blogexp(qiobjqitxt/τ)j=1Bexp(qiobjqjtxt/τ)\mathcal L_{Q\!\to\!Q_t} = -\frac{1}{B}\sum_{i=1}^B \log\frac {\exp(q_i^{\text{obj}}\cdot q_i^{\text{txt}}/\tau)} {\sum_{j=1}^B\exp(q_i^{\text{obj}}\cdot q_j^{\text{txt}}/\tau)}

LQt ⁣ ⁣Q=1Bi=1Blogexp(qitxtqiobj/τ)j=1Bexp(qitxtqjobj/τ)\mathcal L_{Q_t\!\to\!Q} = -\frac{1}{B}\sum_{i=1}^B \log\frac {\exp(q_i^{\text{txt}}\cdot q_i^{\text{obj}}/\tau)} {\sum_{j=1}^B\exp(q_i^{\text{txt}}\cdot q_j^{\text{obj}}/\tau)}

Lcontra=LQ ⁣ ⁣Qt+LQt ⁣ ⁣Q\mathcal L_{\mathrm{contra}} = \mathcal L_{Q\!\to\!Q_t} + \mathcal L_{Q_t\!\to\!Q}

Here, τ\tau denotes a learnable temperature parameter. The total loss combines classification (Lcls\mathcal L_{\mathrm{cls}}), mask losses (Lbce+Ldice\mathcal L_{\mathrm{bce}}+\mathcal L_{\mathrm{dice}}), and the weighted contrastive term (λcontraLcontra\lambda_{\mathrm{contra}}\mathcal L_{\mathrm{contra}}), with empirical weights λcontra=0.5\lambda_{\mathrm{contra}}=0.5, λcls=2\lambda_{\mathrm{cls}}=2, and λbce=λdice=5\lambda_{\mathrm{bce}}=\lambda_{\mathrm{dice}}=5 (Jain et al., 2022).

4. Training Protocols and Datasets

Training utilizes standard segmentation datasets:

  • ADE20K: 20k train, 2k val, 150 classes
  • Cityscapes: 2,975 train, 500 val, 19 classes
  • COCO: 118k train, 5k val, 133 classes

Augmentation strategies employ random short-edge resize, crop, color jitter, and flip on ADE20K/Cityscapes; Large Scale Jittering (LSJ) multi-scale augmentation (0.1–2.0) and cropping to 1024×1024 on COCO. The optimizer is AdamW, with learning rate 1e-4 and weight decay 0.1 (ADE/Cityscapes) or 0.05 (COCO), poly learning rate for ADE/Cityscapes, and step schedule with 10-iteration warmup for COCO. Batch size is 16; training runs for 160k iterations (ADE20K), 90k (Cityscapes), and 100 epochs (COCO) (Jain et al., 2022).

5. Quantitative and Resource Efficiency Results

OneFormer demonstrates competitive or superior accuracy compared to task-specialized Mask2Former models on several benchmarks:

Dataset Model Panoptic PQ Instance AP Semantic mIoU
ADE20K Mask2Former (P/I/S) 48.7/34.9/— 34.2/34.9/— 54.5/—/56.1
ADE20K OneFormer (joint) 49.8 35.9 57.0
Cityscapes Mask2Former (panoptic) 66.6 43.6 82.9
Cityscapes OneFormer (joint) 67.2 45.6 83.0
COCO Mask2Former (panoptic) 57.8 48.7 67.4
COCO OneFormer (joint) 57.9 49.0 67.4

OneFormer reduces resource usage by a factor of three: a single 219M-parameter model, single 160k training run (ADE20K), versus three separate 216M-parameter Mask2Former models trained individually. This yields substantial savings in both training time and storage (~⅔ reduction) (Jain et al., 2022).

6. Ablation Studies and Model Analysis

Ablation results assess the architectural and loss components:

  • Task conditioning: Removal of Qtask\mathbf Q_{\text{task}} reduces AP by 2.3 and PQ by 0.7; omitting learnable text context Qctx\mathbf Q_{\text{ctx}} reduces PQ by 4.5; zero initialization reduces PQ by 1.4 and AP by 1.1.
  • Contrastive loss: Exclusion drops PQ from 67.2 to 58.8 (–8.4) and AP by 3.2; replacing with query classification loss results in PQ 66.4 (–0.8), AP 44.7 (–0.9).
  • Text templates: Using “a photo with a {CLS}” outperforms alternatives such as adding task type suffixes or using class names alone (Jain et al., 2022).

7. Qualitative Insights and Failure Modes

Reduced category confusions—such as “wall” vs “fence” and “vegetation” vs “terrain”—are observed, attributed to the query–text contrastive loss. The model produces task-dynamic outputs: instance segmentation yields “thing” masks, semantic segmentation produces single masks per class, and panoptic segmentation recovers both instance and “stuff” regions. Noted failure modes include missed small objects (notably in COCO), imperfect mask boundaries in cluttered areas, and GT annotation mismatches between panoptic and instance modes (COCO). OneFormer predictions align more closely with panoptic GT, reflecting its primary training source (Jain et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to OneFormer.