TinySAM: Pushing the Envelope for Efficient Segment Anything Model (2312.13789v2)
Abstract: Recently segment anything model (SAM) has shown powerful segmentation capability and has drawn great attention in computer vision fields. Massive following works have developed various applications based on the pretrained SAM and achieved impressive performance on downstream vision tasks. However, SAM consists of heavy architectures and requires massive computational capacity, which hinders the further application of SAM on computation constrained edge devices. To this end, in this paper we propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance. We first propose a full-stage knowledge distillation method with hard prompt sampling and hard mask weighting strategy to distill a lightweight student model. We also adapt the post-training quantization to the promptable segmentation task and further reduce the computational cost. Moreover, a hierarchical segmenting everything strategy is proposed to accelerate the everything inference by $2\times$ with almost no performance degradation. With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task. Extensive experiments on various zero-shot transfer tasks demonstrate the significantly advantageous performance of our TinySAM against counterpart methods. Pre-trained models and codes are available at https://github.com/xinghaochen/TinySAM and https://gitee.com/mindspore/models/tree/master/research/cv/TinySAM.
- Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9157–9166, 2019.
- Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nature methods, 16(12):1247–1253, 2019.
- Segment anything in 3d with nerfs. arXiv preprint arXiv:2304.12308, 2023.
- Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
- Optical flow distillation: Towards efficient and stable video style transfer. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 614–630. Springer, 2020.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3009–3018. IEEE, 2019.
- Towards efficient instance segmentation with hierarchical distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2019.
- Improving lightweight addernet via distillation from ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm. IEEE Transactions on Image Processing, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
- Instance segmentation for autonomous log grasping in forestry operations. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6064–6071. IEEE, 2022.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Distilling object detectors via decoupled features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2154–2164, 2021.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
- Yolo by ultralytics, 2023. https://github.com/ultralytics/ultralytics.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023a.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023b.
- Exploring denoised cross-video contrast for weakly-supervised temporal action localization. In CVPR, pages 19914–19924, 2022a.
- Spatial-channel token distillation for vision mlps. In International Conference on Machine Learning, pages 12685–12695. PMLR, 2022b.
- Exploring plain vision transformer backbones for object detection, 2022c.
- Q-vit: Accurate and fully quantized low-bit vision transformer. Advances in Neural Information Processing Systems, 35:34451–34463, 2022d.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014a.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014b.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- Pd-quant: Post-training quantization based on prediction difference metric. In CVPR, pages 24427–24437, 2023.
- Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8759–8768, 2018.
- Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2604–2613, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021a.
- Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103, 2021b.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
- Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
- Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5007–5016, 2019.
- Doors: Dataset for boulders segmentation. Zenodo, 9:20, 2022.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7262–7272, 2021.
- Tsptq-vit: Two-scaled post-training quantization for vision transformer. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Easyquant: Post-training quantization via scale optimization. arXiv preprint arXiv:2006.16669, 2020.
- Tinyvit: Fast pretraining distillation for small vision transformers. In European Conference on Computer Vision, pages 68–85. Springer, 2022.
- Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In ECCV, pages 191–207. Springer, 2022.
- Faster segment anything: Towards lightweight sam for mobile applications, 2023.
- Fast segment anything, 2023.
- Han Shu (14 papers)
- Wenshuo Li (18 papers)
- Yehui Tang (63 papers)
- Yiman Zhang (5 papers)
- Yihao Chen (40 papers)
- Houqiang Li (236 papers)
- Yunhe Wang (145 papers)
- Xinghao Chen (66 papers)