HRSAM: Efficient Interactive Segmentation in High-Resolution Images
Abstract: The Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images. This requires downsampling to meet GPU constraints, sacrificing the fine-grained details needed for high-precision interactive segmentation. To address SAM's limitations, we focus on visual length extrapolation and propose a lightweight model named HRSAM. The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions. We begin by finding the link between the extrapolation and attention scores, which leads us to base HRSAM on Swin attention. We then introduce the Flexible Local Attention (FLA) framework, using CUDA-optimized Efficient Memory Attention to accelerate HRSAM. Within FLA, we implement Flash Swin attention, achieving over a 35% speedup compared to traditional Swin attention, and propose a KV-only padding mechanism to enhance extrapolation. We also develop the Cycle-scan module that uses State Space models to efficiently expand HRSAM's receptive field. We further develop the HRSAM++ within FLA by adding an anchor map, providing multi-scale data augmentation for the extrapolation and a larger receptive field at slight computational cost. Experiments show that, under standard training, HRSAMs surpass the previous SOTA with only 38% of the latency. With SAM-distillation, the extrapolation enables HRSAMs to outperform the teacher model at lower latency. Further finetuning achieves performance significantly exceeding the previous SOTA.
- Xformers - flash attention. https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.fmha.attn_bias.BlockDiagonalMask. Accessed: 2024-05-18.
- Deconstructing LLM-Flash Attention: A Comprehensive Guide Starting from Softmax. https://zhuanlan.zhihu.com/p/663932651, 2023. Accessed: May 21, 2024.
- Efficient interactive annotation of segmentation datasets with polygon-rnn++. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 859–868. IEEE Computer Society, 2018.
- Flexivit: One model for all patch sizes. In CVPR, 2023.
- Once for all: Train one network and specialize it for efficient deployment. ICLR, Apr 2020.
- Changemamba: Remote sensing change detection with spatio-temporal state space model, 2024.
- Rsmamba: Remote sensing image classification with state space model, 2024.
- Focalclick: towards practical interactive image segmentation. pages 1300–1309, 2022.
- Sam-med2d. arXiv preprint arXiv:2308.16184, 2023.
- Kerple: Kernelized relative positional embedding for length extrapolation. NeurIPS, 35, 2022.
- Twins: Revisiting the design of spatial attention in vision transformers. Neural Information Processing Systems,Neural Information Processing Systems, Dec 2021.
- Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. arXiv preprint arXiv:2307.06304, 2023.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
- Md-dose: A diffusion model based on the mamba for radiotherapy dose prediction, 2024.
- Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In ICLR, 2021.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12094–12103, 2022.
- Star-transformer. In Proceedings of the 2019 Conference of the North, Jan 2019.
- LVIS: A dataset for large vocabulary instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 5356–5364. Computer Vision Foundation / IEEE, 2019.
- Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5961–5971, 2023.
- Masked autoencoders are scalable vision learners. In CVPR, Jun 2022.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Axial attention in multidimensional transformers. Cornell University - arXiv,Cornell University - arXiv, Sep 2019.
- Zigma: A dit-style zigzag mamba diffusion model, 2024.
- Localmamba: Visual state space model with windowed selective scan, 2024.
- Interformer: Real-time interactive image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22301–22311, 2023.
- Ccnet: Criss-cross attention for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, Jan 2020.
- Interactive image segmentation via backpropagating refinement scheme. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 5297–5306. Computer Vision Foundation / IEEE, 2019.
- Segment anything in high quality. In NeurIPS, 2023.
- Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
- Segment anything. arXiv:2304.02643, 2023.
- Anton Konushin Konstantin Sofiiuk, Ilia A. Petrov. Reviving iterative training with mask guidance for interactive segmentation. arXiv: Computer Vision and Pattern Recognition, 2021.
- Matryoshka representations for adaptive deployment. arXiv preprint arXiv:2205.13147, May 2022.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In ICML, 2023.
- Win-win: Training high-resolution vision transformers from two windows. arXiv preprint arXiv:2310.00632, 2023.
- Multi-granularity interaction simulation for unsupervised interactive segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 666–676, 2023.
- Interactive image segmentation with cross-modality vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 762–772, 2023.
- Mamba-nd: Selective state space modeling for multi-dimensional data, 2024.
- Spikemba: Multi-modal spiking saliency mamba for temporal video grounding, 2024.
- Omg-seg: Is one model good enough for all segmentation? In CVPR, 2024.
- Exploring plain vision transformer backbones for object detection.
- Interactive image segmentation with latent diversity. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 577–585. IEEE Computer Society, 2018.
- Super vision transformer. IJCV, May 2022.
- Microsoft coco: Common objects in context. Lecture Notes in Computer Science, 2014.
- Focuscut: Diving into a focus view in interactive segmentation. pages 2637–2646, 2022.
- Interactive image segmentation with first click attention. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 13336–13345. IEEE, 2020.
- Rscama: Remote sensing image change captioning with state space model, 2024.
- Swin-umamba: Mamba-based unet with imagenet-based pretraining, 2024.
- Rethinking interactive image segmentation with low latency, high quality, and diverse prompts. arXiv preprint arXiv:2404.00741, 2024.
- Simpleclick: Interactive image segmentation with simple vision transformers. arXiv preprint arXiv:2210.11006, 2022.
- Pseudoclick: Interactive image segmentation with click imitation. pages 728–745, 2022.
- Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Dgmamba: Domain generalization via generalized state space model, 2024.
- U-mamba: Enhancing long-range dependency for biomedical image segmentation, 2024.
- Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
- Deep extreme cut: From extreme points to object segmentation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 616–625. IEEE Computer Society, 2018.
- Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918, 2023.
- Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018.
- Simba: Simplified mamba-based architecture for vision and multivariate time series, 2024.
- Efficientvmamba: Atrous selective scan for light weight visual mamba, 2024.
- A benchmark dataset and evaluation methodology for video object segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 724–732. IEEE Computer Society, 2016.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Vl-mamba: Exploring state space models for multimodal learning, 2024.
- Dynamite: Dynamic query bootstrapping for multi-object interactive segmentation transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1043–1052, 2023.
- Beyond fixation: Dynamic window visual transformer.
- Vm-unet: Vision mamba unet for medical image segmentation, 2024.
- Gamba: Marry gaussian splatting with mamba for single view 3d reconstruction, 2024.
- When do we not need larger vision models? arXiv preprint arXiv:2403.13043, 2024.
- F-BRS: rethinking backpropagating refinement for interactive segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8620–8629. IEEE, 2020.
- Ba-sam: Scalable bias-mode attention mask for segment anything model. arXiv preprint arXiv:2401.02317, 2024.
- Segmenter: Transformer for semantic segmentation. pages 7262–7272, 2021.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Scaling local self-attention for parameter efficient visual backbones. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
- Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
- Sigma: Siamese mamba network for multi-modal semantic segmentation, 2024.
- Review of large vision models and visual prompt engineering. arXiv preprint arXiv:2307.00855, 2023.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023.
- Crossformer: A versatile vision transformer based on cross-scale attention. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Jul 2021.
- Mamba-unet: Unet-like pure visual mamba for medical image segmentation, 2024.
- Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Cornell University - arXiv,Cornell University - arXiv, Dec 2021.
- Efficientsam: Leveraged masked image pretraining for efficient segment anything. arXiv preprint arXiv:2312.00863, 2023.
- Structured click control in transformer-based interactive segmentation. arXiv preprint arXiv:2405.04009, 2024.
- Deep interactive object selection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 373–381. IEEE Computer Society, 2016.
- Rap-sam: Towards real-time all-purpose segment anything. arXiv preprint arXiv:2401.10228, 2024.
- Piclick: Picking the desired mask in click-based interactive segmentation. arXiv preprint arXiv:2304.11609, 2023.
- Plainmamba: Improving non-hierarchical mamba in visual recognition, 2024.
- Focal self-attention for local-global interactions in vision transformers.
- Remamber: Referring image segmentation with mamba twister, 2024.
- P-mamba: Marrying perona malik diffusion with mamba for efficient pediatric echocardiographic left ventricular segmentation, 2024.
- Adavit: Adaptive tokens for efficient vision transformer. arXiv preprint arXiv:2112.07658, 2021.
- Bignas: Scaling up neural architecture search with big single-stage models. In ECCV, Jan 2020.
- Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems, 34:7281–7293, 2021.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Interactive object segmentation with inside-outside guidance. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 12231–12241. IEEE, 2020.
- Leveraging ai predicted and expert revised annotations in interactive segmentation: Continual tuning or full training? arXiv preprint arXiv:2402.19423, 2024.
- Cobra: Extending mamba to multi-modal large language model for efficient inference, 2024.
- Interactive segmentation as gaussion process classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19488–19497, 2023.
- Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
- Efficient attention: Attention with linear complexities. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan 2021.
- Segment everything everywhere all at once. Advances in Neural Information Processing Systems, 36, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.