OMG-Seg: Unified Transformer for Segmentation

Updated 7 January 2026

OMG-Seg is a unified transformer-based architecture that efficiently handles over 10 segmentation tasks, including semantic, instance, panoptic, video, and interactive segmentation.
It leverages a frozen CLIP backbone with a shared pixel decoder and transformer mask decoder to drastically reduce parameter overhead while maintaining performance.
Quantitative evaluations show competitive metrics (PQ, mAP, VPQ) across diverse datasets, demonstrating robust multi-task learning and effective parameter sharing.

OMG-Seg (One Model is Good Enough for All Segmentation) is a unified transformer-based architecture designed to efficiently address over ten diverse segmentation tasks within a single model. It integrates image and video semantic, instance, and panoptic segmentation, including their open-vocabulary, prompt-driven, and interactive variants, such as those inspired by Segment Anything Model (SAM) and video object segmentation, all with a shared parameterization and minimal per-task customization. The model leverages a frozen large-scale visual backbone and shares almost all components across tasks, providing significantly reduced computational and parameter overhead without sacrificing competitive performance (Li et al., 2024).

1. Transformer Encoder–Decoder Architecture

OMG-Seg employs a three-stage pipeline: a frozen CLIP visual backbone, a pixel decoder, and a unified transformer-based mask decoder. The backbone is ConvNeXt-based, extracting multi-scale features $F^{\text{frozen}}_1, F^{\text{frozen}}_2, F^{\text{frozen}}_3$ from images or videos. These features are then projected by a lightweight deformable-attention pixel decoder (mirroring Mask2Former) into a unified dimensional space, producing %%%%1%%%% for the mask decoder.

The mask decoder is a multi-stage DETR-style transformer decoder that processes $K$ learnable queries across three levels, combining multi-head self-attention among queries with cross-attention to the fused multi-scale features. Each query outputs a classification score (using cosine similarity to CLIP text embeddings) and a mask prediction generated by dot-product with high-resolution features $F_3^{\text{fuse}}$ .

The end-to-end architecture performs the sequence: Image/video → frozen CLIP backbone → pixel decoder → shared mask decoder → task-specific mask/class outputs.

2. Unified Query-Based Task Representation

Central to OMG-Seg is the unification of all segmentation tasks by representing outputs as queries. There are two principal query types:

Semantic Queries ( $Q^s$ ): Each represents an object, stuff region, or video tube mask for tasks including image/video semantic, instance, panoptic, video-panoptic, video-instance, video-semantic segmentation (VSS), and video object segmentation (VOS). Each query carries a mask (of shape $H \times W$ or $T \times H \times W$ ), class label $c_i$ —using CLIP text embeddings for open-vocabulary—and optionally an instance ID $d_i$ for video applications.
Location Queries ( $Q^l$ ): Each is generated from user prompts (points/boxes) for interactive segmentation (SAM-style) via a dedicated prompt encoder. Location queries attend only to their prompted regions and omit self-attention, ensuring no cross-interactions.

The model accommodates, with the same decoder, standard image and video tasks, open-vocabulary cases (with text embedding-based classification), and prompt-driven interactive scenarios. This design supports:

Semantic Segmentation (SS)
Instance Segmentation (IS)
Panoptic Segmentation (PS)
Video Semantic (VSS), Video Instance (VIS), and Video Panoptic (VPS)
Video Object Segmentation (VOS, class-agnostic tracking)
Open-vocabulary and interactive segmentation

3. Training Objective and Loss Functions

OMG-Seg conducts Hungarian matching per sample between ground-truth entities and $K$ predicted queries. For each matched query $i$ , the following losses are computed:

Classification Loss (Cross-Entropy):

$L_{\text{cls}} = -\sum_k \mathbf{1}[c_i = k] \cdot \log \hat{p}_i(k)$

Mask Losses:
- Binary Cross-Entropy:
$L_{\text{ce}} = -[m_i \log \hat{m}_i + (1-m_i) \log (1-\hat{m}_i)]$ - Dice Loss:

$L_{\text{dice}} = 1 - \frac{2|\hat{m}_i \wedge m_i|}{|\hat{m}_i| + |m_i| + \epsilon}$

The global objective sums these over all matched queries and all tasks, leading to a joint loss:

$L = \lambda_{\text{cls}} L_{\text{cls}} + \lambda_{\text{ce}} L_{\text{ce}} + \lambda_{\text{dice}} L_{\text{dice}}$

where the default experiment uses $\lambda$ 's equal to 1. This facilitates parameter sharing and multi-task learning across images, videos, and interactive input (Li et al., 2024).

4. Co-Training and Dataset Integration

OMG-Seg is trained via joint multi-task learning with balanced sampling across diverse datasets:

Image tasks: COCO Panoptic, COCO SAM (synthetic prompts), ADE-20k (open-vocab)
Video tasks: VIPSeg (VPS), YouTube-VIS 2019 (VIS), DAVIS-17 (VOS)
Open-vocabulary evaluation: YouTube-VIS 2021, ADE-20k, DAVIS-17

Batch construction uses COCO as an anchor and up/down-samples other datasets for balanced task exposure, with pseudo-video creation for static images (duplicated as 2-frame clips). The frozen backbone mitigates catastrophic forgetting, and unified queries require decoders to integrate both spatial and temporal information. Empirical analysis (Tab. 4) shows positive transfer: adding VIPSeg notably boosts video performance metrics with minor negative impact on COCO PS; integrating YouTube-VIS further improves VIS and open-vocabulary VIS.

OMG-Seg is characterized by almost complete parameter sharing:

All tasks use the same frozen visual backbone (CLIP), pixel decoder, and transformer mask decoder.
No per-task output heads are used—only the query embeddings and classifier (semantic/location, class/prompt) differentiate the task.
This yields a drastic parameter reduction: 221M parameters vs ~1326M for a naïve multi-head baseline (6× smaller), with GFLOPS ≈ 868.

Previous “unified” segmentation models generally maintain separate heads per task, whereas OMG-Seg’s design forgoes this, demonstrating that a single decoder suffices for competitive results across >10 tasks.

6. Quantitative Performance and Ablation Analysis

The main single-model with ConvNeXt-Large frozen backbone achieves:

Task	Metric
COCO Panoptic	PQ=53.8
Cityscapes Panoptic	PQ=65.7
COCO Instance	mAP=44.5
VIPSeg VPS	VPQ=49.8
YouTube-VIS-19 VIS	mAP=56.4
YouTube-VIS-21 OV VIS	mAP=50.5
ADE-20k OV Panoptic	PQ=27.9
DAVIS-17 OV VOS J/F	74.3
COCO-SAM (interactive)	mIoU=58.0

Scaling up the backbone to ConvNeXt-XX-Large brings +1–4 points to all tasks. Joint co-training across 5 datasets retains performance on video tasks and only yields a minor drop in ADE-20k panoptic scores due to class frequency imbalance (Tab. 3).

Ablation studies reveal:

Decoder and adapter sharing saves 22M parameters with negligible performance cost.
Deeper pixel decoders slightly improve PQ and VPQ for short schedules, but differences vanish at longer training.
Model performance scales positively with backbone size.
Adding more video data (VIPSeg, YouTube-VIS) consistently improves video and open-vocabulary segmentation, with minimal effect on core image tasks.
Masking interactive queries in self-attention significantly improves COCO-SAM mIoU (from 40.7 to 52.2).

7. Architectural Illustrations and Summary

Architectural schematics (Fig. 2) demonstrate the overall system flow:

(a) Composite structure: CLIP backbone, pixel adapter, shared mask decoder, prompt encoder.
(b) Transformer decoder layers explicitly distinguish between masked and standard self-attention (for $Q^l$ vs $Q^s$ ).
(c) Training and inference pathways, including open-vocabulary classification via CLIP text embeddings.

Associated tables (Tabs. 2–9) provide exhaustive metrics across tasks and detailed analyses of co-training impact, architectural component variations, and query attention mechanisms.

OMG-Seg establishes that a straightforward DETR-style transformer architecture with comprehensively shared parameters and carefully constructed queries, when co-trained on a diverse task suite, can deliver robust, unified performance in semantic, instance, panoptic, video, open-vocabulary, and interactive segmentation (Li et al., 2024).

PDF Markdown Chat (Pro)

References (1)

OMG-Seg: Is One Model Good Enough For All Segmentation? (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to OMG-Seg.

OMG-Seg: Unified Transformer for Segmentation

1. Transformer Encoder–Decoder Architecture

2. Unified Query-Based Task Representation

3. Training Objective and Loss Functions

4. Co-Training and Dataset Integration

6. Quantitative Performance and Ablation Analysis

7. Architectural Illustrations and Summary

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

OMG-Seg: Unified Transformer for Segmentation

1. Transformer Encoder–Decoder Architecture

2. Unified Query-Based Task Representation

3. Training Objective and Loss Functions

4. Co-Training and Dataset Integration

5. Parameterization and Model Sharing

6. Quantitative Performance and Ablation Analysis

7. Architectural Illustrations and Summary

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research