Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Segment Everything Everywhere All at Once (2304.06718v4)

Published 13 Apr 2023 in cs.CV
Segment Everything Everywhere All at Once

Abstract: In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown in Fig.1. In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like LLMs. More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from decoder to image features; and iv) Semantic-awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for open-vocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. Notably, our single SEEM model achieves competitive performance across interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision. Furthermore, SEEM showcases a remarkable capacity for generalization to novel prompts or their combinations, rendering it a readily universal image segmentation interface.

Analyzing SEEM: A Comprehensive Approach to Image Segmentation

The paper "Segment Everything Everywhere All at Once" introduces SEEM, a model for image segmentation tasks that provides a unified and interactive interface. Focused on the segmentation challenge, SEEM addresses various segmentation needs—semantic, instance, and panoptic segmentation—within an open-set framework. The authors emphasize SEEM’s versatility, compositionality, interactivity, and semantic-awareness, drawing analogies between its universal interface potential to the capabilities exhibited by LLMs.

Technical Methodology and Key Design Elements

SEEM is structured around a distinctive decoder mechanism enabling diverse prompting, similar to how advanced LLMs function in text processing. The model uses visual prompts to unify different spatial queries—like points, boxes, and masks—placing them within a joint visual-semantic space. This integration allows for dynamic compositions between visual and textual prompts, paving the way for SEEM’s strong compositionality capability. By incorporating memory prompts, SEEM retains previous segmentation information, enhancing its interactivity. A text encoder further enriches SEEM’s function by encoding text queries and mask labels in the same space for open-vocabulary segmentation tasks. Thus, SEEM bridges the gap to a universal segmentation model.

Empirical Validation

Empirical studies validate SEEM's performance across several datasets. The model demonstrates strong results in interactive segmentation, generalized segmentation tasks, and video object segmentation using minimal supervision. For example, SEEM exhibits competitive performance across nine diverse datasets. Importantly, it achieves this with only 1/100th of the amount of labeled data typically required. This efficiency underscores the model's ability to generalize and adapt to novel prompts, such as using an exemplar image segment for additional insight into unseen tasks.

Implications for Image Segmentation

The introduction of SEEM marks a potential shift in the development of universal models for image segmentation. Its capabilities highlight advancements in the automatic alignment between visual input and textual data, which can significantly reduce the manual labor needed for dataset labeling. Furthermore, SEEM’s composability allows practitioners to interactively refine tasks and adapt segmentation tasks without retraining or significant modifications on the model.

Future Prospects in AI

Looking forward, SEEM’s approach could inspire advancements in developing more generalized models that accommodate multi-modal inputs. Given its promising results, future research could extend SEEM’s framework beyond image segmentation to other domains where interactive and multi-modal data handling is essential. The continued evolution of computational power and more sophisticated training datasets will likely support this trajectory, leading to more sophisticated and robust AI models.

In conclusion, SEEM’s contribution to the field of image segmentation is a model designed for adaptability and universality—a testament to the potential of prompt-based architectures in image processing. While SEEM has set a high standard for future segmentation interfaces, its implications underscore the broader movement towards more interactive and universal AI models capable of complex task handling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
  2. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  3. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
  4. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  5. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5463–5474, 2021.
  6. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
  7. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. arXiv preprint arXiv:2206.02777, 2022.
  8. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  9. Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143, 2021.
  10. Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022.
  11. Generalized decoding for pixel, image, and language. arXiv preprint arXiv:2212.11270, 2022.
  12. Side adapter network for open-vocabulary semantic segmentation. arXiv preprint arXiv:2302.12242, 2023.
  13. Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1271–1280, 2017.
  14. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10502–10511, 2019.
  15. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022.
  16. A unified mutual supervision framework for referring expression segmentation and generation. arXiv preprint arXiv:2211.07919, 2022.
  17. Polyformer: Referring image segmentation as sequential polygon generation. 2023.
  18. Reviving iterative training with mask guidance for interactive segmentation, 2021.
  19. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR, 2021.
  20. Simpleclick: Interactive image segmentation with simple vision transformers. arXiv preprint arXiv:2210.11006, 2022.
  21. Focalclick: towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022.
  22. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  24. OpenAI. Gpt-4 technical report, 2023.
  25. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  26. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR, 2021.
  27. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  28. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  29. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  30. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  31. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
  32. A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131, 2023.
  33. Lazy snapping. ACM Transactions on Graphics (ToG), 23(3):303–308, 2004.
  34. Leo Grady. Random walks for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 28(11):1768–1783, 2006.
  35. Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373–381, 2016.
  36. Segment anything, 2023.
  37. King-Sun Fu and JK Mui. A survey on image segmentation. Pattern recognition, 13(1):3–16, 1981.
  38. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2009.
  39. Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055, 2019.
  40. Image segmentation using deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence, 2021.
  41. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  42. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  43. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9157–9166, 2019.
  44. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  45. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1280–1289, 2022.
  46. Oneformer: One transformer to rule universal image segmentation. arXiv preprint arXiv:2211.06220, 2022.
  47. Mp-former: Mask-piloted transformer for image segmentation. arXiv preprint arXiv:2303.07336, 2023.
  48. Universal instance perception as object discovery and retrieval. arXiv preprint arXiv:2303.06674, 2023.
  49. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  50. Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023.
  51. Microsoft coco: Common objects in context. In ECCV, 2014.
  52. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016.
  53. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18155–18165, 2022.
  54. Pseudoclick: Interactive image segmentation with click imitation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pages 728–745. Springer, 2022.
  55. UViM: A unified modeling approach for vision with learned guiding codes. arXiv preprint arXiv:2205.10337, 2022.
  56. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
  57. Images speak in images: A generalist painter for in-context visual learning, 2023.
  58. Focal modulation networks. arXiv preprint arXiv:2203.11926, 2022.
  59. Davit: Dual attention vision transformers. arXiv preprint arXiv:2204.03645, 2022.
  60. Unified contrastive learning in image-text-label space. In CVPR, 2022.
  61. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  62. Focuscut: Diving into a focus view in interactive segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2637–2646, 2022.
  63. Agss-vos: Attention guided single-shot video object segmentation. In ICCV, 2019.
  64. A generative appearance model for end-to-end video object segmentation, 2018.
  65. Swem: Towards real-time video object segmentation with sequential weighted expectation-maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1362–1372, 2022.
  66. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, 2022.
  67. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019.
  68. Language as queries for referring video object segmentation. arXiv preprint arXiv:2201.00487, 2022.
  69. Track anything: Segment anything meets videos, 2023.
  70. Personalize segment anything model with one shot, 2023.
  71. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
  72. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  73. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
  74. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Xueyan Zou (21 papers)
  2. Jianwei Yang (93 papers)
  3. Hao Zhang (947 papers)
  4. Feng Li (286 papers)
  5. Linjie Li (89 papers)
  6. Jianfeng Wang (149 papers)
  7. Lijuan Wang (133 papers)
  8. Jianfeng Gao (344 papers)
  9. Yong Jae Lee (88 papers)
Citations (347)