OneFormer: Unified Transformer for Segmentation
- The paper introduces a unified transformer model that outperforms specialized Mask2Former models by achieving state-of-the-art results on semantic, instance, and panoptic segmentation tasks.
- OneFormer employs task-conditioned queries and multi-scale deformable attention to flexibly switch output domains with a single text prompt during inference.
- The framework reduces computational resources by consolidating three separate models into one 219M-parameter architecture validated on ADE20K, Cityscapes, and COCO.
OneFormer is a single-model, transformer-based framework for universal image segmentation that achieves state-of-the-art performance on semantic, instance, and panoptic segmentation tasks within a unified architecture and training process. Unlike prior methods that require specialized models and separately trained architectures for each segmentation paradigm, OneFormer introduces a train-once, task-conditioned approach capable of dynamically switching its output domain simply by varying a text prompt at inference. OneFormer is the first segmentation model to simultaneously outperform three specialized Mask2Former models trained individually on ADE20K, Cityscapes, and COCO, using a single set of model parameters and substantially reduced computational resources (Jain et al., 2022).
1. Unified Architecture and Segmentation Pipeline
OneFormer utilizes a ViT-style backbone, supporting Swin-L, ConvNeXt-L, and DiNAT-L variants, to extract multi-scale features at input resolutions of 1/4, 1/8, 1/16, and 1/32. These features are fused by a pixel decoder based on Multi-Scale Deformable Attention, upsampling to a 1/4-scale feature map . The framework is conditioned on a text prompt—“the task is {panoptic/instance/semantic}”—which is tokenized and mapped via a small transformer to produce a D-dimensional task token .
Object queries are initialized by repeating and processed by self-/cross-attention, yielding image-aware, task-conditioned queries. These are concatenated with itself to form task-conditioned object queries . A multi-layer transformer decoder alternates masked cross-attention, self-attention, and MLP operations, attending these queries to the multi-scale pixel features. Each decoder output head predicts -class scores (with the number of dataset classes, and “+1” for “no-object”) and binary masks, computed as an einsum between learned query embeddings and . The inference process mirrors Mask2Former, including thresholding and post-processing, to yield panoptic, instance, or semantic segmentation outputs depending on the prompt (Jain et al., 2022).
2. Task-Conditioned Multi-Task Training
OneFormer eschews the standard paradigm of training a separate model for each segmentation type. Instead, it is trained once using only panoptic annotations. For each training iteration, a segmentation task is uniformly sampled ( each): panoptic, semantic, or instance. From the panoptic ground truth, the method derives masks corresponding to each task specification:
- Semantic: single mask per class, including “stuff” and “thing” categories
- Instance: per-instance masks for “things” only
- Panoptic: both “thing” masks and amorphous “stuff” masks
A text entry list is rendered for each ground-truth mask as “a photo with a {CLS}”, padded to with “a/an {task} photo” for “no-object”; these are tokenized/encoded to text queries . The instruction prompt “the task is {task}” is encoded as , ensuring that both query initialization and decoder operation are conditioned appropriately. At inference, changing this prompt directly modulates the segmentation domain, with outputs for instance, semantic, or panoptic segmentation (Jain et al., 2022).
3. Query–Text Contrastive Loss and Optimization
To enforce discrimination between objects and task-modes, OneFormer introduces a bidirectional query–text contrastive loss. For matched object/text query pairs in a batch of size , the forward and backward contrastive losses are:
Here, denotes a learnable temperature parameter. The total loss combines classification (), mask losses (), and the weighted contrastive term (), with empirical weights , , and (Jain et al., 2022).
4. Training Protocols and Datasets
Training utilizes standard segmentation datasets:
- ADE20K: 20k train, 2k val, 150 classes
- Cityscapes: 2,975 train, 500 val, 19 classes
- COCO: 118k train, 5k val, 133 classes
Augmentation strategies employ random short-edge resize, crop, color jitter, and flip on ADE20K/Cityscapes; Large Scale Jittering (LSJ) multi-scale augmentation (0.1–2.0) and cropping to 1024×1024 on COCO. The optimizer is AdamW, with learning rate 1e-4 and weight decay 0.1 (ADE/Cityscapes) or 0.05 (COCO), poly learning rate for ADE/Cityscapes, and step schedule with 10-iteration warmup for COCO. Batch size is 16; training runs for 160k iterations (ADE20K), 90k (Cityscapes), and 100 epochs (COCO) (Jain et al., 2022).
5. Quantitative and Resource Efficiency Results
OneFormer demonstrates competitive or superior accuracy compared to task-specialized Mask2Former models on several benchmarks:
| Dataset | Model | Panoptic PQ | Instance AP | Semantic mIoU |
|---|---|---|---|---|
| ADE20K | Mask2Former (P/I/S) | 48.7/34.9/— | 34.2/34.9/— | 54.5/—/56.1 |
| ADE20K | OneFormer (joint) | 49.8 | 35.9 | 57.0 |
| Cityscapes | Mask2Former (panoptic) | 66.6 | 43.6 | 82.9 |
| Cityscapes | OneFormer (joint) | 67.2 | 45.6 | 83.0 |
| COCO | Mask2Former (panoptic) | 57.8 | 48.7 | 67.4 |
| COCO | OneFormer (joint) | 57.9 | 49.0 | 67.4 |
OneFormer reduces resource usage by a factor of three: a single 219M-parameter model, single 160k training run (ADE20K), versus three separate 216M-parameter Mask2Former models trained individually. This yields substantial savings in both training time and storage (~⅔ reduction) (Jain et al., 2022).
6. Ablation Studies and Model Analysis
Ablation results assess the architectural and loss components:
- Task conditioning: Removal of reduces AP by 2.3 and PQ by 0.7; omitting learnable text context reduces PQ by 4.5; zero initialization reduces PQ by 1.4 and AP by 1.1.
- Contrastive loss: Exclusion drops PQ from 67.2 to 58.8 (–8.4) and AP by 3.2; replacing with query classification loss results in PQ 66.4 (–0.8), AP 44.7 (–0.9).
- Text templates: Using “a photo with a {CLS}” outperforms alternatives such as adding task type suffixes or using class names alone (Jain et al., 2022).
7. Qualitative Insights and Failure Modes
Reduced category confusions—such as “wall” vs “fence” and “vegetation” vs “terrain”—are observed, attributed to the query–text contrastive loss. The model produces task-dynamic outputs: instance segmentation yields “thing” masks, semantic segmentation produces single masks per class, and panoptic segmentation recovers both instance and “stuff” regions. Noted failure modes include missed small objects (notably in COCO), imperfect mask boundaries in cluttered areas, and GT annotation mismatches between panoptic and instance modes (COCO). OneFormer predictions align more closely with panoptic GT, reflecting its primary training source (Jain et al., 2022).