Papers
Topics
Authors
Recent
Search
2000 character limit reached

OMG-Seg: One Model for All Segmentation

Updated 17 March 2026
  • The paper demonstrates a unified transformer architecture that achieves competitive segmentation performance across varied tasks while reducing parameters.
  • It employs a frozen CLIP backbone with a shared query-based decoder and joint vision–language interface to handle image and video tasks without task-specific fine-tuning.
  • Empirical results show competitive performance with only a 3–6 point drop versus specialized models, highlighting effective multi-task co-training and efficiency gains.

OMG-Seg (One Model is Good Enough for All Segmentation) is a unified transformer-based architecture for visual segmentation that supports a broad spectrum of tasks—including image semantic segmentation (SS), instance segmentation (IS), panoptic segmentation (PS) and their counterparts in video, as well as open-vocabulary, prompt-driven, and interactive segmentation modalities. The model achieves competitive performance across more than ten segmentation tasks while significantly reducing parameter count and computational requirements compared to independently trained or partially unified task-specific models. OMG-Seg leverages a joint vision–language interface via a frozen CLIP backbone and a shared query-based decoder, enabling cross-task generalization and open-vocabulary recognition without the need for architectural modifications or fine-tuning of the visual encoder (Li et al., 2024).

1. Unified Transformer-Based Architecture

OMG-Seg employs a transformer-based encoder–decoder framework, parameterized as follows:

  • Backbone: A frozen CLIP visual encoder (specifically, ConvNeXt from OpenCLIP) extracts multi-scale features from input images or video frames. The backbone remains unchanged during downstream training, preserving its open-vocabulary capabilities.
  • Pixel Decoder: The pixel decoder adopts Mask2Former’s multi-scale deformable attention “neck” to synthesize a unified multi-level feature pyramid, denoted {Fjfuse}\{F_j^{\text{fuse}}\}.
  • Query System: All tasks are managed via two query types:
    • Semantic queries (QobjsQ^{s}_{obj}): Capture object and stuff masks or tubes (spatiotemporal) for images and videos, respectively.
    • Location queries (QobjlQ^{l}_{obj}): Encode spatial prompts (e.g., points or boxes) for interactive use cases.
  • Mask Decoder: A stack of transformer decoder layers processes the unified query set. For most tasks, masked cross-attention accesses the pixel feature pyramid, and queries interact through multi-head self-attention; for prompt-driven interactive tasks, self-attention is disabled to ensure queries attend only to local prompt regions.
  • Open-Vocabulary Head: Class predictions are produced by computing the cosine similarity between mask-centered image features and corresponding class name embeddings extracted from the frozen CLIP text encoder, rather than via learned classifier weights.

This configuration allows uniform handling of disparate segmentation tasks, with all outputs parameterized as sets of entities (masks and optional class labels or IDs), and yields a highly extensible framework for segmentation applications.

2. Formal Mathematical Task Unification

OMG-Seg recasts all segmentation problems into a unified set-prediction paradigm using object queries:

Let IRH×W×3I\in\mathbb{R}^{H\times W\times 3} and VRT×H×W×3V\in\mathbb{R}^{T\times H\times W\times 3} denote image or video input, and define target sets as:

  • Image segmentation (SS/IS/PS): {yi}i=1G\{y_i\}_{i=1}^{G} with yi=(mi,ci), mi{0,1}H×W, ciCimagey_i = (m_i, c_i),\ m_i\in\{0,1\}^{H\times W},\ c_i\in C_{\text{image}}
  • Video segmentation (VIS/VPS/VOS): {yi}i=1N\{y_i\}_{i=1}^{N} with yi=(mi,ci,di), mi{0,1}T×H×W, ciCvideo, diNy_i = (m_i, c_i, d_i),\ m_i\in\{0,1\}^{T\times H\times W},\ c_i\in C_{\text{video}},\ d_i\in\mathbb{N}
  • Interactive segmentation: Prompts PRN×pP\in\mathbb{R}^{N\times p}, QobjsQ^{s}_{obj}0
  • Open-vocabulary segmentation: QobjsQ^{s}_{obj}1 as above, with QobjsQ^{s}_{obj}2 not restricted to a closed set, enabling recognition of novel classes via text embeddings.

Each decoder query QobjsQ^{s}_{obj}3 produces:

  • Mask logits: QobjsQ^{s}_{obj}4 (or QobjsQ^{s}_{obj}5 for videos)
  • Class scores (text-based): QobjsQ^{s}_{obj}6, where QobjsQ^{s}_{obj}7 is the CLIP text embedding for class QobjsQ^{s}_{obj}8.

This formalism supports all tasks and modes, including hybrid formulations (e.g., prompt-driven video segmentation or open-vocabulary interactive segmentation) through simple mixing of query types.

3. Multi-Task Co-Training and Loss Design

OMG-Seg's multi-task regime samples training data from all included image and video datasets. Query outputs are matched to ground truth via the Hungarian matching algorithm, supporting set-based predictions. For each matched query–target pair QobjsQ^{s}_{obj}9, the following losses are computed:

  • Classification loss: QobjlQ^{l}_{obj}0 (cross-entropy on text-based classifier)
  • Mask probability (CE) loss: QobjlQ^{l}_{obj}1
  • Dice loss: QobjlQ^{l}_{obj}2

The overall loss is a weighted sum: QobjlQ^{l}_{obj}3.

This multi-task design exploits knowledge transfer across tasks and domains. For example, video tasks benefit from the large volume of static images via pseudo-video sampling in the joint regime.

4. Supported Task Spectrum and Modality Generalization

OMG-Seg supports twelve segmentation task classes, including:

  • Image tasks: Semantic, instance, and panoptic segmentation.
  • Video tasks: Video instance segmentation (VIS), video panoptic segmentation (VPS), video semantic segmentation (VSS), and video object segmentation (VOS). VOS is addressed by class-agnostic tube matching to first-frame ground truth.
  • Interactive/prompt-driven segmentation: Location queries represent user-provided points or boxes, enabling SAM-style interaction while reusing the shared decoder.
  • Open-vocabulary segmentation: CLIP-based semantic head enables recognition of unseen categories at test time.
  • Mixed modes: Combinations such as prompt-driven video segmentation and open-vocabulary interactive segmentation are directly supported via unified query input.

A plausible implication is that the query-based abstraction enables the emergence of new segmentation modes "for free" through compositional mixing of input query types.

5. Empirical Performance and Ablations

Comparative quantitative results demonstrate that OMG-Seg's unified approach achieves task performance within 3–6 PQ/mAP points of the best dedicated models on strong benchmarks, while operating over a much broader task set:

Task Metric Mask2Former Tube-Link TarViS OMG-Seg (CNeXt-L)
COCO-PS PQ 57.8 53.8
COCO-IS mAP 50.1 44.5
VIPSeg-VPS VPQ 54.5 48.0 49.8
YT-VIS-19 mAP 52.8 56.4
YT-VIS-21-OV mAP 50.5
ADE-OV PQ 27.9
DAVIS-17 VOS J&F 74.3
COCO-SAM mIoU 58.0

Ablation studies reveal:

  • Joint multi-dataset co-training boosts video task performance significantly while maintaining image segmentation accuracy. For instance, COCO-only yields VIPSeg-VPS = 32.2, while joint training with video datasets yields VIPSeg-VPS = 48.5.
  • Parameter sharing via a unified decoder outperforms decoupled task-specific heads on VIPSeg and reduces parameter overhead (221 M vs. 243 M).

Performance on large-scale panoptic benchmarks remains competitive but exhibits a drop, likely due to the increased task balancing challenges and dataset taxonomy shifts.

6. Efficiency: Parameter, Computational, and Scaling Properties

OMG-Seg achieves considerable efficiency gains:

  • Replacing three independent Mask2Former-style models for COCO-PS, VIPSeg-VPS, and YT-VIS (totaling ≈1.3 B parameters) with a single 221 M parameter model—a reduction by a factor of approximately 6.
  • Inference cost for high-resolution images (1200×800) is ≈868 GFlops, matching that of Mask2Former and incurring no additional penalty for supporting multiple tasks.
  • By leveraging the frozen CLIP backbone and shared query system, no architectural branching or task-specific parameter replication is required.

These properties support scalable deployment and adaptation to broader segmentation scenarios.

7. Methodological Insights, Limitations, and Future Directions

The results of OMG-Seg demonstrate the practical viability of extensive parameter sharing for segmentation:

  • Effectiveness of "one model": A single model can deliver strong, balanced segmentation performance across tasks that traditionally require extensive specialization (Li et al., 2024).
  • Emergent open-vocabulary ability: Freezing the CLIP visual encoder and performing classification with cosine similarity to CLIP text embeddings enables open-vocabulary recognition with no additional cost or tuning.
  • Knowledge transfer: Unified queries foster transfer: for example, video segmentation leverages knowledge distilled from image tasks, and interactive and open-vocabulary settings co-train without negative mutual interference.
  • Limitations: Moderate performance degradation is observed on large/balanced semantic and panoptic benchmarks. Dataset imbalance and label taxonomy shifts may require further algorithmic or architectural innovation.
  • Research directions: Opportunities for extension include scaling up the transformer decoder, integrating more adapters, adding explicit language input for referring segmentation, or coupling with LLMs for advanced vision–language reasoning.

In sum, OMG-Seg establishes a robust, extensible paradigm for unified segmentation, setting a precedent for architectures that can "do it all" in both image and video domains, and suggesting a trajectory toward truly general-purpose segmentation models (Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OMG-Seg (One Model is Good Enough for All Segmentation).