Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OMG-Seg: Is One Model Good Enough For All Segmentation? (2401.10229v2)

Published 18 Jan 2024 in cs.CV

Abstract: In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (103)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Tarvis: A unified architecture for target-based video segmentation. In CVPR, 2023.
  3. Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
  4. Visual prompting via image inpainting. In NeurIPS, 2022.
  5. The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557, 2018.
  6. End-to-end object detection with transformers. In ECCV, 2020.
  7. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
  8. Hybrid task cascade for instance segmentation. In CVPR, 2019.
  9. MMdetection: Open mmlab detection toolbox and benchmark. arXiv preprint, 2019.
  10. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.
  11. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  12. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  13. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
  14. Open-vocabulary panoptic segmentation with embedding modulation. ICCV, 2023.
  15. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  16. Mask2former for video instance segmentation. arXiv pre-print, 2021.
  17. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
  18. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  19. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
  20. MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023.
  21. MOSE: A new dataset for video object segmentation in complex scenes. In ICCV, 2023.
  22. VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI, 2023.
  23. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  24. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  25. Explore in-context learning for 3d point cloud understanding. arXiv preprint arXiv:2306.08659, 2023.
  26. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
  27. DaTaseg: Taming a universal multi-dataset multi-task segmentation model. In NeurIPS, 2023.
  28. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2021.
  29. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  30. Mask r-cnn. In ICCV, 2017.
  31. Minvis: A minimal video instance segmentation framework without video-based training. In NeurIPS, 2022.
  32. Openclip, July 2021.
  33. OneFormer: One Transformer to Rule Universal Image Segmentation. In CVPR, 2023.
  34. Video panoptic segmentation. In CVPR, 2020.
  35. Tubeformer-deeplab: Video mask transformer. In CVPR, 2022.
  36. Panoptic feature pyramid networks. In CVPR, 2019.
  37. Panoptic segmentation. In CVPR, 2019.
  38. Segment anything. In ICCV, 2023.
  39. F-VLM: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
  40. Mseg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020.
  41. Language-driven semantic segmentation. In ICLR, 2022.
  42. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
  43. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  44. Correlational image modeling for self-supervised visual pre-training. In CVPR, 2023.
  45. Transformer-based visual segmentation: A survey. arXiv pre-print, 2023.
  46. Improving semantic segmentation via decoupled body and edge supervision. In ECCV, 2020.
  47. Panoptic-partformer: Learning a unified model for panoptic part segmentation. In ECCV, 2022.
  48. Semantic flow for fast and accurate scene parsing. In ECCV, 2020.
  49. Tube-link: A flexible cross tube baseline for universal video segmentation. In ICCV, 2023.
  50. Sfnet: Faster and accurate domain agnostic semantic segmentation via semantic flow. IJCV, 2023.
  51. Video k-net: A simple, strong, and unified baseline for video segmentation. In CVPR, 2022.
  52. Microsoft coco: Common objects in context. In ECCV, 2014.
  53. A convnet for the 2020s. In CVPR, 2022.
  54. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2023.
  55. Contour and texture analysis for image segmentation. IJCV, 2001.
  56. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  57. Large-scale video panoptic segmentation in the wild: A benchmark. In CVPR, 2022.
  58. Vspw: A large-scale dataset for video scene parsing in the wild. In CVPR, 2021.
  59. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016.
  60. Image segmentation using deep learning: A survey. PAMI, 2021.
  61. Video object segmentation using space-time memory networks. In ICCV, 2019.
  62. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  63. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In CVPR, 2021.
  64. Freeseg: Unified, universal and open-vocabulary image segmentation. In CVPR, 2023.
  65. Learning transferable visual models from natural language supervision. In ICML, 2021.
  66. Object class segmentation using random forests. In BMVC, 2008.
  67. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  68. Attention is all you need. In NIPS, 2017.
  69. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
  70. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  71. Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021.
  72. Skeleton-in-context: Unified skeleton sequence modeling with in-context learning. arXiv preprint arXiv:2312.03703, 2023.
  73. Solo: Segmenting objects by locations. In ECCV, 2020.
  74. Hierarchical open-vocabulary universal image segmentation. In NeurIPS, 2023.
  75. Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023.
  76. Seggpt: Segmenting everything in context. In ICCV, 2023.
  77. End-to-end video instance segmentation with transformers. In CVPR, 2021.
  78. Detecting everything in the open world: Towards universal object detection. In CVPR, 2023.
  79. Seqformer: Sequential transformer for video instance segmentation. In ECCV, 2022.
  80. Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation. In ICCV, 2023.
  81. Towards open vocabulary learning: A survey. arXiv pre-print, 2023.
  82. In defense of online models for video instance segmentation. In ECCV, 2022.
  83. Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023.
  84. Masked frequency modeling for self-supervised visual pre-training. In ICLR, 2023.
  85. Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. In CVPR, 2023.
  86. Dst-det: Simple dynamic self-training for open-vocabulary object detection. arXiv preprint arXiv:2310.01393, 2023.
  87. Rap-sam: Towards real-time all-purpose segment anything. arXiv preprint, 2024.
  88. Universal instance perception as object discovery and retrieval. In CVPR, 2023.
  89. Video instance segmentation. In ICCV, 2019.
  90. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
  91. k-means mask transformer. In ECCV, 2022.
  92. Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively. arXiv preprint, 2024.
  93. Open-vocabulary detr with conditional matching. In ECCV, 2022.
  94. Open-vocabulary object detection using captions. In CVPR, 2021.
  95. A simple framework for open-vocabulary segmentation and detection. In ICCV, 2023.
  96. K-net: Towards unified image segmentation. In NeurIPS, 2021.
  97. Semantic understanding of scenes through the ADE20K dataset. In CVPR, 2017.
  98. Rethinking evaluation metrics of open-vocabulary segmentaion. arXiv preprint arXiv:2311.03352, 2023.
  99. A survey on deep learning technique for video segmentation. PAMI, 2023.
  100. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  101. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
  102. Generalized decoding for pixel, image and language. In CVPR, 2023.
  103. Segment everything everywhere all at once. In NeurIPS, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Xiangtai Li (128 papers)
  2. Haobo Yuan (22 papers)
  3. Wei Li (1122 papers)
  4. Henghui Ding (87 papers)
  5. Size Wu (12 papers)
  6. Wenwei Zhang (77 papers)
  7. Yining Li (29 papers)
  8. Kai Chen (512 papers)
  9. Chen Change Loy (288 papers)
Citations (34)

Summary

  • The paper introduces OMG-Seg, a unified model that efficiently handles diverse segmentation tasks using a shared encoder-decoder architecture.
  • It utilizes a transformer-based design with task-specific queries to streamline training and inference across semantic, instance, and panoptic segmentation.
  • The model demonstrates competitive accuracy across over ten segmentation tasks while reducing computational footprint and parameter complexity.

Introduction to OMG-Seg

The landscape of visual segmentation in computer vision is complex, with a variety of tasks that have traditionally required distinct models or architectures to solve. The newly introduced OMG-Seg, short for "One Model that is Good enough," aims to revolutionize this area by presenting a unified approach. Unlike previous models that focused on particular segmentation tasks, OMG-Seg is versatile enough to efficiently address multiple segmentation challenges with a single encoder-decoder architecture. Specifically, this includes image and video semantic, instance, and panoptic segmentation, as well as interactive and open vocabulary segmentations.

Achievements and Evaluation

OMG-Seg represents the culmination of significant research efforts, utilizing a transformer-based architecture that integrates task-specific queries and outputs. With its innovative approach, the model demonstrates that it can reduce the footprint of traditional models in terms of both computation and the number of parameters required. Its performance has been rigorously tested across more than ten distinct segmentation tasks and multiple datasets, showing that OMG-Seg can uphold a satisfactory level of accuracy while embracing a broad scope of applications.

Technology Behind OMG-Seg

The core of OMG-Seg's innovation lies in its shared encoder-decoder structure coupled with a unified representation for different segmentation outputs. Queries within the model can represent masks, unique IDs, or visual prompts, enabling the shared decoder to process a diverse array of queries. This approach allows for significant parameter sharing across tasks and simplifies training and inference procedures. By co-training on combined datasets, OMG-Seg exhibits its capability for multiple segmentation tasks, ranging from individual frames to entire video sequences.

Unified Approach Over Specialized Methods

Comparisons with other methods shed light on the competitiveness and potential of OMG-Seg. While specialized models may have the upper hand in certain segmentation tasks, none can match the universality of OMG-Seg's framework. A comparative paper on various models demonstrates that OMG-Seg holds its ground against task-specific architectures. Its ability to operate in a wide range of scenarios, including complex video segmentation and interactive settings, underscores the model's flexibility.

In essence, OMG-Seg is not just another segmentation tool; it is a step toward a universal model that aims to be the Swiss Army knife of visual segmentation. By successfully training one model for numerous segmentation tasks, this approach paves the path for more efficient and simplified processes in image and video analysis. As visual segmentation continues to serve crucial roles in technological advancements like autonomous driving and augmented reality, the impact of a model like OMG-Seg could be far-reaching.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com