Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively (2401.02955v2)

Published 5 Jan 2024 in cs.CV

Abstract: The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the na\"{i}ve baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.

Introduction

The advancement of vision foundation models (VFMs) has seen notable proliferation, notably through models like CLIP and Segment Anything Model (SAM). These models have provided great strides in the field of computer vision, with SAM becoming a pivotal tool for segmentation tasks and CLIP for its striking zero-shot recognition capabilities. However, each model bears its limitations when operating in isolation, such as SAM's lack of recognition ability and CLIP's challenges with dense predictions. Addressing these shortcomings, this paper introduces the Open-Vocabulary SAM, an inventive framework that fuses the functionality of SAM and CLIP, enhancing both segmentation and recognition across a vast range of classes.

Knowledge Transfer Modules

The paper details two novel knowledge transfer modules central to this integration: SAM2CLIP and CLIP2SAM. The SAM2CLIP module enables the transfer of knowledge from SAM to CLIP using a distillation process and transformer-like adapters, allowing for knowledge alignment without modifying the robust CLIP encoder. Meanwhile, CLIP2SAM applies the reverse, transferring knowledge from CLIP to SAM to augment the model’s recognition capabilities while maintaining effective segmentation. These modules work synergistically in a unified encoder-decoder framework, substantially outperforming baseline models that naively combine SAM and CLIP without considering their architectural differences and knowledge compatibility.

Experiments and Results

Extensive experiments conducted across a spectrum of datasets, including COCO and LVIS, demonstrate the superior performance of the Open-Vocabulary SAM. The method showcases over a 20% improvement in recognizing previously unseen objects on the LVIS dataset and enhanced segmentation and classification performance on the COCO dataset. The key is the joint training with both segmentation mask and label annotations, leading to a synergy between SAM's and CLIP's functionalities. This combination allows Open-Vocabulary SAM to interactively segment and recognize approximately 22,000 classes, a significant upscale from its predecessors.

Implications and Future Directions

Open-Vocabulary SAM presents a robust AI architecture that has practical applications in image analysis, including interactive segmentations such as those used in autonomous driving or medical imaging. By effectively segmenting and recognizing a wide range of objects, the model sets the stage for more accurate and efficient image annotation tools. Additionally, the model’s open-vocabulary capabilities potentiate widespread use in fields requiring domain-specific recognition, from wildlife conservation to smart city surveillance, by learning from a vast and variable dataset. While this paper marks a leap forward, it also opens avenues for further research to fine-tune models for specific domains and expand their interactive capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Towards in-context scene understanding. arXiv preprint arXiv:2306.01667, 2023.
  2. BEiT: BERT pre-training of image transformers. In ICLR, 2022.
  3. Visual prompting via image inpainting. In NeurIPS, 2022.
  4. Language models are few-shot learners. In NeurIPS, 2020.
  5. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  6. Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation. arXiv preprint arXiv:2309.08842, 2023.
  7. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  8. Vision transformer adapter for dense predictions. In ICLR, 2023.
  9. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  10. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
  11. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  12. Decoupling zero-shot semantic segmentation. In CVPR, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  14. Unleashing vanilla vision transformer with masked image modeling for object detection. In ICCV, 2023.
  15. Explore in-context learning for 3d point cloud understanding. In NeurIPS, 2023.
  16. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  17. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2021.
  18. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  19. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  20. Mask R-CNN. In ICCV, 2017.
  21. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  22. Zero-shot recognition with unreliable attributes. In NeruIPS, 2014.
  23. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  24. Learning open-world object proposals without learning to classify. RA-L, 2022.
  25. ViLT: Vision-and-language transformer without convolution or region supervision. In ICML, 2021.
  26. Panoptic segmentation. In CVPR, 2019.
  27. Segment anything. In ICCV, 2023.
  28. F-VLM: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
  29. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
  30. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In CVPR, 2023.
  31. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  32. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  33. Align before fuse: Vision and language representation learning with momentum distillation. In NeruIPS, 2021.
  34. Transformer-based visual segmentation: A survey. arXiv pre-print, 2023.
  35. Semantic flow for fast and accurate scene parsing. In ECCV, 2020.
  36. OMG-Seg:is one model good enough for all segmentation? arXiv, 2023.
  37. Sfnet: Faster and accurate domain agnostic semantic segmentation via semantic flow. IJCV, 2023.
  38. Scaling language-image pre-training via masking. In CVPR, 2023.
  39. Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
  40. Scaling & shifting your features: A new baseline for efficient model tuning. In NeurIPS, 2022.
  41. Microsoft COCO: Common objects in context. In ECCV, 2014.
  42. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  43. Effective adapter for face recognition in the wild. arXiv preprint, 2023.
  44. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  45. Class-agnostic object detection with multi-modal transformer. In ECCV, 2022.
  46. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016.
  47. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
  48. High-quality entity segmentation. In ICCV, 2023.
  49. Open world entity segmentation. TPAMI, 2022.
  50. Learning transferable visual models from natural language supervision. In ICML, 2021.
  51. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  52. Learning to retrieve prompts for in-context learning. arXiv:2112.08633, 2021.
  53. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  54. Edadet: Open-vocabulary object detection using early dense alignment. In ICCV, 2023.
  55. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  56. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308, 2023.
  57. V3det: Vast vocabulary visual detection dataset. In ICCV, 2023.
  58. Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023.
  59. Seggpt: Segmenting everything in context. In ICCV, 2023.
  60. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
  61. Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation. In ICCV, 2023.
  62. Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880, 2023.
  63. Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023.
  64. Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023.
  65. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023.
  66. Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation. arXiv preprint arXiv:2309.13042, 2023.
  67. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
  68. Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023.
  69. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, 2022.
  70. Dst-det: Simple dynamic self-training for open-vocabulary object detection. arXiv pre-print, 2023.
  71. Panoptic video scene graph generation. In CVPR, 2023.
  72. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
  73. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In CVPR, 2022.
  74. Open-vocabulary DETR with conditional matching. In ECCV, 2022.
  75. Open-vocabulary object detection using captions. In CVPR, 2021.
  76. LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
  77. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
  78. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023.
  79. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023.
  80. Fast segment anything, 2023.
  81. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  82. Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam. arXiv preprint arXiv:2312.06660, 2023.
  83. Extract free dense labels from clip. In ECCV, 2022.
  84. Rethinking evaluation metrics of open-vocabulary segmentaion. arXiv preprint arXiv:2311.03352, 2023.
  85. Learning to prompt for vision-language models. IJCV, 2022.
  86. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  87. Segment everything everywhere all at once. In NeurIPS, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haobo Yuan (22 papers)
  2. Xiangtai Li (128 papers)
  3. Chong Zhou (12 papers)
  4. Yining Li (29 papers)
  5. Kai Chen (512 papers)
  6. Chen Change Loy (288 papers)
Citations (36)