VRP-SAM: SAM with Visual Reference Prompt (2402.17726v3)
Abstract: In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at \url{https://github.com/syp2ysy/VRP-SAM}
- Application of segment anything model for civil infrastructure defect assessment. arXiv preprint arXiv:2304.12600, 2023.
- Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Few-shot segmentation without meta-learning: A good transductive inference is all you need? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13979–13988, 2021.
- Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.05803, 2023.
- Samaug: Point prompt augmentation for segment anything model. arXiv preprint arXiv:2307.01187, 2023.
- Self-support few-shot semantic segmentation. In European Conference on Computer Vision, pages 701–719. Springer, 2022.
- Deep learning universal crater detection using segment anything model (sam). arXiv preprint arXiv:2304.07764, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. In European Conference on Computer Vision, pages 108–126. Springer, 2022.
- Dense gaussian processes for few-shot segmentation. In European Conference on Computer Vision, pages 217–234. Springer, 2022.
- Segment anything in high quality. arXiv preprint arXiv:2306.01567, 2023.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Learning what not to segment: A new perspective on few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8057–8067, 2022.
- Adaptive prototype learning and allocation for few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8334–8343, 2021.
- Dynamic prototype convolution network for few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2022a.
- Learning non-target knowledge for few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11573–11582, 2022b.
- Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:2305.13310, 2023.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8741–8750, 2021.
- Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6941–6952, 2021.
- Hm: Hybrid masking for few-shot segmentation. In European Conference on Computer Vision, pages 506–523. Springer, 2022.
- Feature weighting and boosting for few-shot segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 622–631, 2019.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Hierarchical dense correlation distillation for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23641–23651, 2023.
- Token contrast for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3093–3102, 2023.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410, 2017.
- Ssa: Semantic structure aware inference for weakly pixel-wise dense predictions without cost. arXiv preprint arXiv:2111.03392, 2021.
- Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning. Advances in Neural Information Processing Systems, 35:37484–37496, 2022.
- Prior guided feature enrichment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence, 44(2):1050–1065, 2020.
- Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
- Panet: Few-shot image semantic segmentation with prototype alignment. In proceedings of the IEEE/CVF international conference on computer vision, pages 9197–9206, 2019.
- Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023a.
- Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023b.
- Edit everything: A text-guided generative system for images editing. arXiv preprint arXiv:2304.14006, 2023.
- Prototype mixture models for few-shot semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 763–778. Springer, 2020.
- Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
- Matte anything: Interactive natural image matting with segment anything models. arXiv preprint arXiv:2306.04121, 2023.
- Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems, 34:21984–21996, 2021.
- Feature-proxy transformer for few-shot segmentation. Advances in Neural Information Processing Systems, 35:6575–6588, 2022.
- Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023a.
- Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023b.
- Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.