Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything (2312.00863v1)

Published 1 Dec 2023 in cs.CV

Abstract: Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.

The paper introduces a new technique named EfficientSAM, which stands for Efficient Segment Anything Model. This methodology aims to lower the computational complexity of large-scale Transformer models used in vision tasks, like the Segment Anything Model (SAM). SAMs are highly effective for a variety of image segmentation tasks, but their large size often restricts their deployment in real-world applications due to high computational demands.

To make SAMs more accessible and practical for use, the researchers developed a strategy involving the use of lightweight Vision Transformer (ViT) models, which retain respectable performance while significantly reducing complexity. The key innovation in this approach is the application of masked image pretraining, referred to as "SAMI", which essentially teaches the smaller models to reconstruct features from the more extensive image encoders used in SAM. After pretraining on a large dataset (SA-1B), these efficient models are fine-tuned to carry out the segment anything task.

The research team conducted extensive evaluations across multiple vision tasks including image classification, object detection, instance segmentation, and semantic detection. The results show that the SAMI method surpasses other masked image pretraining approaches, and the EfficientSAM models, with their lightweight encoders, achieve significant performance gains over other existing models, particularly in terms of accuracy compared to complexity trade-offs.

EfficientSAM models are especially noteworthy because they allow segmenting objects in images without specifically being trained on those objects, which is sometimes referred to as zero-shot learning. For instance, on the COCO and LVIS datasets, the EfficientSAM model demonstrates superior segment anything performance in comparison to commonly used lightweight models.

In summary, the EfficientSAM methodology presents a more computationally efficient alternative to SAM, maintaining high accuracy in various vision tasks, including the advanced capability of zero-shot learning. The researchers plan to release the code and models to support further development and application of efficient SAM models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2010.
  2. Masked autoencoders enable efficient knowledge distillers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24256–24265, 2023.
  3. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  4. Salient object detection: A survey. Computational visual media, 5:117–150, 2019.
  5. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  7. Sad: Segment any rgbd. arXiv preprint arXiv:2305.14207, 2023.
  8. Semantic segment anything. https://github.com/fudan-zvg/Semantic-Segment-Anything, 2023a.
  9. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  10. Sam-adapter: Adapting segment anything in underperformed scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3367–3375, 2023b.
  11. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021.
  12. Context autoencoder for self-supervised representation learning. International Journal of Computer Vision, pages 1–16, 2023c.
  13. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  14. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
  15. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  16. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  17. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv preprint arXiv:2304.04155, 2023.
  18. Bootstrapped masked autoencoders for vision bert pretraining. In European Conference on Computer Vision, pages 247–264. Springer, 2022.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  20. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021.
  21. Editanything: Empowering unparalleled flexibility in image editing and generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9414–9416, 2023.
  22. Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12259–12269, 2021.
  23. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  24. Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278, 2023.
  25. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  26. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  27. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  28. Milan: Masked image pretraining on language assisted representation. arXiv preprint arXiv:2208.06049, 2022.
  29. Restore anything pipeline: Segment anything meets image restoration. arXiv preprint arXiv:2305.13093, 2023.
  30. Yolo by ultralytics. https://github.com/ultralytics/ultralytics, 2023.
  31. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  32. Mobilenetv3. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization, pages 125–144, 2021.
  33. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501, 2022a.
  34. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022b.
  35. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 35:12934–12949, 2022c.
  36. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  37. Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual impairments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1857–1867, 2023a.
  38. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14430, 2023b.
  39. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  40. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  41. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  42. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In International Conference on Learning Representations, 2021.
  43. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  44. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  45. U2-net: Going deeper with nested u-structure for salient object detection. page 107404, 2020.
  46. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  47. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  48. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  49. Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023.
  50. Explain any concept: Segment anything meets concept-based explanation. arXiv preprint arXiv:2305.10289, 2023.
  51. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023.
  52. Segment anything meets semantic communication. arXiv preprint arXiv:2306.02094, 2023.
  53. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  54. Segmentation as selective search for object recognition. In 2011 international conference on computer vision, pages 1879–1886. IEEE, 2011.
  55. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  56. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
  57. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021.
  58. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI conference on artificial intelligence, pages 2567–2575, 2022.
  59. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  60. Self-supervised models are good teaching assistants for vision transformers. In Proceedings of the 39th International Conference on Machine Learning, pages 24031–24042. PMLR, 2022a.
  61. Tinyvit: Fast pretraining distillation for small vision transformers. In European Conference on Computer Vision, pages 68–85. Springer, 2022b.
  62. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16684–16693, 2021.
  63. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  64. Knowledge distillation via softmax regression representation learning. In International Conference on Learning Representations, 2021.
  65. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
  66. Castling-vit: Compressing self-attention via switching towards linear-angular attention at vision transformer inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14431–14442, 2023.
  67. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
  68. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023a.
  69. Deshadow-anything: When segment anything model meets zero-shot shadow removal. arXiv preprint arXiv:2309.11715, 2023b.
  70. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022.
  71. Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
  72. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  73. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Yunyang Xiong (25 papers)
  2. Bala Varadarajan (1 paper)
  3. Lemeng Wu (29 papers)
  4. Xiaoyu Xiang (26 papers)
  5. Fanyi Xiao (25 papers)
  6. Chenchen Zhu (26 papers)
  7. Xiaoliang Dai (44 papers)
  8. Dilin Wang (37 papers)
  9. Fei Sun (151 papers)
  10. Forrest Iandola (23 papers)
  11. Raghuraman Krishnamoorthi (29 papers)
  12. Vikas Chandra (75 papers)
Citations (94)