Papers
Topics
Authors
Recent
2000 character limit reached

OMG-Seg: Unified Transformer for Segmentation

Updated 7 January 2026
  • OMG-Seg is a unified transformer-based architecture that efficiently handles over 10 segmentation tasks, including semantic, instance, panoptic, video, and interactive segmentation.
  • It leverages a frozen CLIP backbone with a shared pixel decoder and transformer mask decoder to drastically reduce parameter overhead while maintaining performance.
  • Quantitative evaluations show competitive metrics (PQ, mAP, VPQ) across diverse datasets, demonstrating robust multi-task learning and effective parameter sharing.

OMG-Seg (One Model is Good Enough for All Segmentation) is a unified transformer-based architecture designed to efficiently address over ten diverse segmentation tasks within a single model. It integrates image and video semantic, instance, and panoptic segmentation, including their open-vocabulary, prompt-driven, and interactive variants, such as those inspired by Segment Anything Model (SAM) and video object segmentation, all with a shared parameterization and minimal per-task customization. The model leverages a frozen large-scale visual backbone and shares almost all components across tasks, providing significantly reduced computational and parameter overhead without sacrificing competitive performance (Li et al., 2024).

1. Transformer Encoder–Decoder Architecture

OMG-Seg employs a three-stage pipeline: a frozen CLIP visual backbone, a pixel decoder, and a unified transformer-based mask decoder. The backbone is ConvNeXt-based, extracting multi-scale features F1frozen,F2frozen,F3frozenF^{\text{frozen}}_1, F^{\text{frozen}}_2, F^{\text{frozen}}_3 from images or videos. These features are then projected by a lightweight deformable-attention pixel decoder (mirroring Mask2Former) into a unified dimensional space, producing %%%%1%%%% for the mask decoder.

The mask decoder is a multi-stage DETR-style transformer decoder that processes KK learnable queries across three levels, combining multi-head self-attention among queries with cross-attention to the fused multi-scale features. Each query outputs a classification score (using cosine similarity to CLIP text embeddings) and a mask prediction generated by dot-product with high-resolution features F3fuseF_3^{\text{fuse}}.

The end-to-end architecture performs the sequence: Image/video → frozen CLIP backbone → pixel decoder → shared mask decoder → task-specific mask/class outputs.

2. Unified Query-Based Task Representation

Central to OMG-Seg is the unification of all segmentation tasks by representing outputs as queries. There are two principal query types:

  • Semantic Queries (QsQ^s): Each represents an object, stuff region, or video tube mask for tasks including image/video semantic, instance, panoptic, video-panoptic, video-instance, video-semantic segmentation (VSS), and video object segmentation (VOS). Each query carries a mask (of shape H×WH \times W or T×H×WT \times H \times W), class label cic_i—using CLIP text embeddings for open-vocabulary—and optionally an instance ID did_i for video applications.
  • Location Queries (QlQ^l): Each is generated from user prompts (points/boxes) for interactive segmentation (SAM-style) via a dedicated prompt encoder. Location queries attend only to their prompted regions and omit self-attention, ensuring no cross-interactions.

The model accommodates, with the same decoder, standard image and video tasks, open-vocabulary cases (with text embedding-based classification), and prompt-driven interactive scenarios. This design supports:

  • Semantic Segmentation (SS)
  • Instance Segmentation (IS)
  • Panoptic Segmentation (PS)
  • Video Semantic (VSS), Video Instance (VIS), and Video Panoptic (VPS)
  • Video Object Segmentation (VOS, class-agnostic tracking)
  • Open-vocabulary and interactive segmentation

3. Training Objective and Loss Functions

OMG-Seg conducts Hungarian matching per sample between ground-truth entities and KK predicted queries. For each matched query ii, the following losses are computed:

  • Classification Loss (Cross-Entropy):

Lcls=k1[ci=k]logp^i(k)L_{\text{cls}} = -\sum_k \mathbf{1}[c_i = k] \cdot \log \hat{p}_i(k)

  • Mask Losses:

    • Binary Cross-Entropy:

    Lce=[milogm^i+(1mi)log(1m^i)]L_{\text{ce}} = -[m_i \log \hat{m}_i + (1-m_i) \log (1-\hat{m}_i)] - Dice Loss:

    Ldice=12m^imim^i+mi+ϵL_{\text{dice}} = 1 - \frac{2|\hat{m}_i \wedge m_i|}{|\hat{m}_i| + |m_i| + \epsilon}

The global objective sums these over all matched queries and all tasks, leading to a joint loss:

L=λclsLcls+λceLce+λdiceLdiceL = \lambda_{\text{cls}} L_{\text{cls}} + \lambda_{\text{ce}} L_{\text{ce}} + \lambda_{\text{dice}} L_{\text{dice}}

where the default experiment uses λ\lambda's equal to 1. This facilitates parameter sharing and multi-task learning across images, videos, and interactive input (Li et al., 2024).

4. Co-Training and Dataset Integration

OMG-Seg is trained via joint multi-task learning with balanced sampling across diverse datasets:

  • Image tasks: COCO Panoptic, COCO SAM (synthetic prompts), ADE-20k (open-vocab)
  • Video tasks: VIPSeg (VPS), YouTube-VIS 2019 (VIS), DAVIS-17 (VOS)
  • Open-vocabulary evaluation: YouTube-VIS 2021, ADE-20k, DAVIS-17

Batch construction uses COCO as an anchor and up/down-samples other datasets for balanced task exposure, with pseudo-video creation for static images (duplicated as 2-frame clips). The frozen backbone mitigates catastrophic forgetting, and unified queries require decoders to integrate both spatial and temporal information. Empirical analysis (Tab. 4) shows positive transfer: adding VIPSeg notably boosts video performance metrics with minor negative impact on COCO PS; integrating YouTube-VIS further improves VIS and open-vocabulary VIS.

5. Parameterization and Model Sharing

OMG-Seg is characterized by almost complete parameter sharing:

  • All tasks use the same frozen visual backbone (CLIP), pixel decoder, and transformer mask decoder.
  • No per-task output heads are used—only the query embeddings and classifier (semantic/location, class/prompt) differentiate the task.
  • This yields a drastic parameter reduction: 221M parameters vs ~1326M for a naïve multi-head baseline (6× smaller), with GFLOPS ≈ 868.

Previous “unified” segmentation models generally maintain separate heads per task, whereas OMG-Seg’s design forgoes this, demonstrating that a single decoder suffices for competitive results across >10 tasks.

6. Quantitative Performance and Ablation Analysis

The main single-model with ConvNeXt-Large frozen backbone achieves:

Task Metric
COCO Panoptic PQ=53.8
Cityscapes Panoptic PQ=65.7
COCO Instance mAP=44.5
VIPSeg VPS VPQ=49.8
YouTube-VIS-19 VIS mAP=56.4
YouTube-VIS-21 OV VIS mAP=50.5
ADE-20k OV Panoptic PQ=27.9
DAVIS-17 OV VOS J/F 74.3
COCO-SAM (interactive) mIoU=58.0

Scaling up the backbone to ConvNeXt-XX-Large brings +1–4 points to all tasks. Joint co-training across 5 datasets retains performance on video tasks and only yields a minor drop in ADE-20k panoptic scores due to class frequency imbalance (Tab. 3).

Ablation studies reveal:

  • Decoder and adapter sharing saves 22M parameters with negligible performance cost.
  • Deeper pixel decoders slightly improve PQ and VPQ for short schedules, but differences vanish at longer training.
  • Model performance scales positively with backbone size.
  • Adding more video data (VIPSeg, YouTube-VIS) consistently improves video and open-vocabulary segmentation, with minimal effect on core image tasks.
  • Masking interactive queries in self-attention significantly improves COCO-SAM mIoU (from 40.7 to 52.2).

7. Architectural Illustrations and Summary

Architectural schematics (Fig. 2) demonstrate the overall system flow:

  • (a) Composite structure: CLIP backbone, pixel adapter, shared mask decoder, prompt encoder.
  • (b) Transformer decoder layers explicitly distinguish between masked and standard self-attention (for QlQ^l vs QsQ^s).
  • (c) Training and inference pathways, including open-vocabulary classification via CLIP text embeddings.

Associated tables (Tabs. 2–9) provide exhaustive metrics across tasks and detailed analyses of co-training impact, architectural component variations, and query attention mechanisms.

OMG-Seg establishes that a straightforward DETR-style transformer architecture with comprehensively shared parameters and carefully constructed queries, when co-trained on a diverse task suite, can deliver robust, unified performance in semantic, instance, panoptic, video, open-vocabulary, and interactive segmentation (Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to OMG-Seg.