OMG-Seg: Is One Model Good Enough For All Segmentation? (2401.10229v2)
Abstract: In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Tarvis: A unified architecture for target-based video segmentation. In CVPR, 2023.
- Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
- Visual prompting via image inpainting. In NeurIPS, 2022.
- The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557, 2018.
- End-to-end object detection with transformers. In ECCV, 2020.
- Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
- Hybrid task cascade for instance segmentation. In CVPR, 2019.
- MMdetection: Open mmlab detection toolbox and benchmark. arXiv preprint, 2019.
- Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
- Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
- A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
- Open-vocabulary panoptic segmentation with embedding modulation. ICCV, 2023.
- Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
- Mask2former for video instance segmentation. arXiv pre-print, 2021.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023.
- MOSE: A new dataset for video object segmentation in complex scenes. In ICCV, 2023.
- VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
- Explore in-context learning for 3d point cloud understanding. arXiv preprint arXiv:2306.08659, 2023.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- DaTaseg: Taming a universal multi-dataset multi-task segmentation model. In NeurIPS, 2023.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2021.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Mask r-cnn. In ICCV, 2017.
- Minvis: A minimal video instance segmentation framework without video-based training. In NeurIPS, 2022.
- Openclip, July 2021.
- OneFormer: One Transformer to Rule Universal Image Segmentation. In CVPR, 2023.
- Video panoptic segmentation. In CVPR, 2020.
- Tubeformer-deeplab: Video mask transformer. In CVPR, 2022.
- Panoptic feature pyramid networks. In CVPR, 2019.
- Panoptic segmentation. In CVPR, 2019.
- Segment anything. In ICCV, 2023.
- F-VLM: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
- Mseg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020.
- Language-driven semantic segmentation. In ICLR, 2022.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Correlational image modeling for self-supervised visual pre-training. In CVPR, 2023.
- Transformer-based visual segmentation: A survey. arXiv pre-print, 2023.
- Improving semantic segmentation via decoupled body and edge supervision. In ECCV, 2020.
- Panoptic-partformer: Learning a unified model for panoptic part segmentation. In ECCV, 2022.
- Semantic flow for fast and accurate scene parsing. In ECCV, 2020.
- Tube-link: A flexible cross tube baseline for universal video segmentation. In ICCV, 2023.
- Sfnet: Faster and accurate domain agnostic semantic segmentation via semantic flow. IJCV, 2023.
- Video k-net: A simple, strong, and unified baseline for video segmentation. In CVPR, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- A convnet for the 2020s. In CVPR, 2022.
- Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2023.
- Contour and texture analysis for image segmentation. IJCV, 2001.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Large-scale video panoptic segmentation in the wild: A benchmark. In CVPR, 2022.
- Vspw: A large-scale dataset for video scene parsing in the wild. In CVPR, 2021.
- V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016.
- Image segmentation using deep learning: A survey. PAMI, 2021.
- Video object segmentation using space-time memory networks. In ICCV, 2019.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In CVPR, 2021.
- Freeseg: Unified, universal and open-vocabulary image segmentation. In CVPR, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Object class segmentation using random forests. In BMVC, 2008.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Attention is all you need. In NIPS, 2017.
- Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
- Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021.
- Skeleton-in-context: Unified skeleton sequence modeling with in-context learning. arXiv preprint arXiv:2312.03703, 2023.
- Solo: Segmenting objects by locations. In ECCV, 2020.
- Hierarchical open-vocabulary universal image segmentation. In NeurIPS, 2023.
- Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023.
- Seggpt: Segmenting everything in context. In ICCV, 2023.
- End-to-end video instance segmentation with transformers. In CVPR, 2021.
- Detecting everything in the open world: Towards universal object detection. In CVPR, 2023.
- Seqformer: Sequential transformer for video instance segmentation. In ECCV, 2022.
- Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation. In ICCV, 2023.
- Towards open vocabulary learning: A survey. arXiv pre-print, 2023.
- In defense of online models for video instance segmentation. In ECCV, 2022.
- Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023.
- Masked frequency modeling for self-supervised visual pre-training. In ICLR, 2023.
- Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. In CVPR, 2023.
- Dst-det: Simple dynamic self-training for open-vocabulary object detection. arXiv preprint arXiv:2310.01393, 2023.
- Rap-sam: Towards real-time all-purpose segment anything. arXiv preprint, 2024.
- Universal instance perception as object discovery and retrieval. In CVPR, 2023.
- Video instance segmentation. In ICCV, 2019.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
- k-means mask transformer. In ECCV, 2022.
- Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively. arXiv preprint, 2024.
- Open-vocabulary detr with conditional matching. In ECCV, 2022.
- Open-vocabulary object detection using captions. In CVPR, 2021.
- A simple framework for open-vocabulary segmentation and detection. In ICCV, 2023.
- K-net: Towards unified image segmentation. In NeurIPS, 2021.
- Semantic understanding of scenes through the ADE20K dataset. In CVPR, 2017.
- Rethinking evaluation metrics of open-vocabulary segmentaion. arXiv preprint arXiv:2311.03352, 2023.
- A survey on deep learning technique for video segmentation. PAMI, 2023.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
- Generalized decoding for pixel, image and language. In CVPR, 2023.
- Segment everything everywhere all at once. In NeurIPS, 2023.
- Xiangtai Li (128 papers)
- Haobo Yuan (22 papers)
- Wei Li (1122 papers)
- Henghui Ding (87 papers)
- Size Wu (12 papers)
- Wenwei Zhang (77 papers)
- Yining Li (29 papers)
- Kai Chen (512 papers)
- Chen Change Loy (288 papers)