OMG-Seg: Unified Transformer for Segmentation
- OMG-Seg is a unified transformer-based architecture that efficiently handles over 10 segmentation tasks, including semantic, instance, panoptic, video, and interactive segmentation.
- It leverages a frozen CLIP backbone with a shared pixel decoder and transformer mask decoder to drastically reduce parameter overhead while maintaining performance.
- Quantitative evaluations show competitive metrics (PQ, mAP, VPQ) across diverse datasets, demonstrating robust multi-task learning and effective parameter sharing.
OMG-Seg (One Model is Good Enough for All Segmentation) is a unified transformer-based architecture designed to efficiently address over ten diverse segmentation tasks within a single model. It integrates image and video semantic, instance, and panoptic segmentation, including their open-vocabulary, prompt-driven, and interactive variants, such as those inspired by Segment Anything Model (SAM) and video object segmentation, all with a shared parameterization and minimal per-task customization. The model leverages a frozen large-scale visual backbone and shares almost all components across tasks, providing significantly reduced computational and parameter overhead without sacrificing competitive performance (Li et al., 2024).
1. Transformer Encoder–Decoder Architecture
OMG-Seg employs a three-stage pipeline: a frozen CLIP visual backbone, a pixel decoder, and a unified transformer-based mask decoder. The backbone is ConvNeXt-based, extracting multi-scale features from images or videos. These features are then projected by a lightweight deformable-attention pixel decoder (mirroring Mask2Former) into a unified dimensional space, producing %%%%1%%%% for the mask decoder.
The mask decoder is a multi-stage DETR-style transformer decoder that processes learnable queries across three levels, combining multi-head self-attention among queries with cross-attention to the fused multi-scale features. Each query outputs a classification score (using cosine similarity to CLIP text embeddings) and a mask prediction generated by dot-product with high-resolution features .
The end-to-end architecture performs the sequence: Image/video → frozen CLIP backbone → pixel decoder → shared mask decoder → task-specific mask/class outputs.
2. Unified Query-Based Task Representation
Central to OMG-Seg is the unification of all segmentation tasks by representing outputs as queries. There are two principal query types:
- Semantic Queries (): Each represents an object, stuff region, or video tube mask for tasks including image/video semantic, instance, panoptic, video-panoptic, video-instance, video-semantic segmentation (VSS), and video object segmentation (VOS). Each query carries a mask (of shape or ), class label —using CLIP text embeddings for open-vocabulary—and optionally an instance ID for video applications.
- Location Queries (): Each is generated from user prompts (points/boxes) for interactive segmentation (SAM-style) via a dedicated prompt encoder. Location queries attend only to their prompted regions and omit self-attention, ensuring no cross-interactions.
The model accommodates, with the same decoder, standard image and video tasks, open-vocabulary cases (with text embedding-based classification), and prompt-driven interactive scenarios. This design supports:
- Semantic Segmentation (SS)
- Instance Segmentation (IS)
- Panoptic Segmentation (PS)
- Video Semantic (VSS), Video Instance (VIS), and Video Panoptic (VPS)
- Video Object Segmentation (VOS, class-agnostic tracking)
- Open-vocabulary and interactive segmentation
3. Training Objective and Loss Functions
OMG-Seg conducts Hungarian matching per sample between ground-truth entities and predicted queries. For each matched query , the following losses are computed:
- Classification Loss (Cross-Entropy):
- Mask Losses:
- Binary Cross-Entropy:
- Dice Loss:
The global objective sums these over all matched queries and all tasks, leading to a joint loss:
where the default experiment uses 's equal to 1. This facilitates parameter sharing and multi-task learning across images, videos, and interactive input (Li et al., 2024).
4. Co-Training and Dataset Integration
OMG-Seg is trained via joint multi-task learning with balanced sampling across diverse datasets:
- Image tasks: COCO Panoptic, COCO SAM (synthetic prompts), ADE-20k (open-vocab)
- Video tasks: VIPSeg (VPS), YouTube-VIS 2019 (VIS), DAVIS-17 (VOS)
- Open-vocabulary evaluation: YouTube-VIS 2021, ADE-20k, DAVIS-17
Batch construction uses COCO as an anchor and up/down-samples other datasets for balanced task exposure, with pseudo-video creation for static images (duplicated as 2-frame clips). The frozen backbone mitigates catastrophic forgetting, and unified queries require decoders to integrate both spatial and temporal information. Empirical analysis (Tab. 4) shows positive transfer: adding VIPSeg notably boosts video performance metrics with minor negative impact on COCO PS; integrating YouTube-VIS further improves VIS and open-vocabulary VIS.
5. Parameterization and Model Sharing
OMG-Seg is characterized by almost complete parameter sharing:
- All tasks use the same frozen visual backbone (CLIP), pixel decoder, and transformer mask decoder.
- No per-task output heads are used—only the query embeddings and classifier (semantic/location, class/prompt) differentiate the task.
- This yields a drastic parameter reduction: 221M parameters vs ~1326M for a naïve multi-head baseline (6× smaller), with GFLOPS ≈ 868.
Previous “unified” segmentation models generally maintain separate heads per task, whereas OMG-Seg’s design forgoes this, demonstrating that a single decoder suffices for competitive results across >10 tasks.
6. Quantitative Performance and Ablation Analysis
The main single-model with ConvNeXt-Large frozen backbone achieves:
| Task | Metric |
|---|---|
| COCO Panoptic | PQ=53.8 |
| Cityscapes Panoptic | PQ=65.7 |
| COCO Instance | mAP=44.5 |
| VIPSeg VPS | VPQ=49.8 |
| YouTube-VIS-19 VIS | mAP=56.4 |
| YouTube-VIS-21 OV VIS | mAP=50.5 |
| ADE-20k OV Panoptic | PQ=27.9 |
| DAVIS-17 OV VOS J/F | 74.3 |
| COCO-SAM (interactive) | mIoU=58.0 |
Scaling up the backbone to ConvNeXt-XX-Large brings +1–4 points to all tasks. Joint co-training across 5 datasets retains performance on video tasks and only yields a minor drop in ADE-20k panoptic scores due to class frequency imbalance (Tab. 3).
Ablation studies reveal:
- Decoder and adapter sharing saves 22M parameters with negligible performance cost.
- Deeper pixel decoders slightly improve PQ and VPQ for short schedules, but differences vanish at longer training.
- Model performance scales positively with backbone size.
- Adding more video data (VIPSeg, YouTube-VIS) consistently improves video and open-vocabulary segmentation, with minimal effect on core image tasks.
- Masking interactive queries in self-attention significantly improves COCO-SAM mIoU (from 40.7 to 52.2).
7. Architectural Illustrations and Summary
Architectural schematics (Fig. 2) demonstrate the overall system flow:
- (a) Composite structure: CLIP backbone, pixel adapter, shared mask decoder, prompt encoder.
- (b) Transformer decoder layers explicitly distinguish between masked and standard self-attention (for vs ).
- (c) Training and inference pathways, including open-vocabulary classification via CLIP text embeddings.
Associated tables (Tabs. 2–9) provide exhaustive metrics across tasks and detailed analyses of co-training impact, architectural component variations, and query attention mechanisms.
OMG-Seg establishes that a straightforward DETR-style transformer architecture with comprehensively shared parameters and carefully constructed queries, when co-trained on a diverse task suite, can deliver robust, unified performance in semantic, instance, panoptic, video, open-vocabulary, and interactive segmentation (Li et al., 2024).