BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model (2401.02317v4)
Abstract: In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. Code is available at https://github.com/zongzi13545329/BA-SAM
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Irwan Bello. Lambdanetworks: Modeling long-range interactions without attention. In CVPR, 2021.
- Attention augmented convolutional networks. In ICCV, 2019.
- Flexivit: One model for all patch sizes. In CVPR, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Once for all: Train one network and specialize it for efficient deployment. ICLR, 2020.
- Adaptformer: Adapting vision transformers for scalable visual recognition. NeurIPS, 2022.
- Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more. In arXiv preprint arXiv:2304.09148, 2023.
- Exploring simple siamese representation learning. In CVPR, 2021.
- Dynamic convolution: Attention over convolution kernels. In CVPR, 2020.
- Kerple: Kernelized relative positional embedding for length extrapolation. NeurIPS, 35, 2022.
- Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In ISBI, 2018.
- Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. arXiv preprint arXiv:2307.06304, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Salient objects in clutter: Bringing salient object detection to the foreground. In ECCV, 2018.
- Camouflaged object detection. In CVPR, 2020.
- Concealed object detection. TPAMI, 2021a.
- Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks. TNNLS, 2021b.
- Re-thinking co-salient object detection. TPAMI, 2022a.
- Salient objects in clutter. TPAMI, 2022b.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
- Dmt: Dynamic mutual training for semi-supervised learning. Pattern Recognition, 2022.
- Convolutional sequence to sequence learning. In ICML, 2017.
- Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In ICLR, 2021.
- Levit: a vision transformer in convnet’s clothing for faster inference. In ICCV, 2021.
- Pit: Position-invariant transform for cross-fov domain adaptation. In ICCV, 2021.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Towards a unified view of parameter-efficient transfer learning. In ICLR, 2021a.
- Deep residual learning for image recognition. In CVPR, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- End-to-end video object detection with spatial-temporal transformers. In ACM MM, 2021b.
- Squeeze-and-excitation networks. In CVPR, 2018.
- Sam struggles in concealed scenes–empirical study on” segment anything”. arXiv preprint arXiv:2304.06022, 2023a.
- Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750, 2023b.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Visual prompt tuning. In ECCV, 2022a.
- Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In CVPR, 2022b.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Matryoshka representations for adaptive deployment. arXiv preprint arXiv:2205.13147, 2022.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In ICML, 2023.
- Selective kernel networks. In CVPR, 2019.
- Prefix-tuning: Optimizing continuous prompts for generation. In ACL), 2021.
- Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
- Super vision transformer. IJCV, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Visual saliency transformer. In ICCV, 2021a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021b.
- Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022.
- Diverse target and contribution scheduling for domain generalization. arXiv preprint arXiv:2309.16460, 2023a.
- Rethinking domain generalization: Discriminability and generalizability. arXiv preprint arXiv:2309.16483, 2023b.
- Image transformer. In ICML, 2018.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Highly accurate dichotomous image segmentation. In ECCV, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Self-attention with relative position representations. In Proceedings of NAACL-HLT, 2018.
- et al Singh, Pranav. Cass: cross architectural self-supervision for medical image analysis. arXiv preprint arXiv:2206.04170, 2022.
- Rethinking implicit neural representations for vision learners. In ICASSP, 2023.
- Bottleneck transformers for visual recognition. In CVPR, 2021.
- JianLin Su. Transformer upgrade road: 7, length extrapolation and local attention, 2023.
- Going deeper with convolutions. In CVPR, 2015.
- Attention is all you need. In NeurIPS, 2017.
- Residual attention network for image classification. In CVPR, 2017a.
- Learning to detect salient objects with image-level supervision. In CVPR, 2017b.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
- Non-local neural networks. In CVPR, 2018.
- Large-scale multi-modal pre-trained models: A comprehensive survey. MIR, 2023.
- Cbam: Convolutional block attention module. In ECCV, 2018.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:2006.03677, 2020a.
- Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677, 2020b.
- Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
- Tinyvit: Fast pretraining distillation for small vision transformers. In ECCV, 2022.
- Simmim: A simple framework for masked image modeling. In CVPR, 2022.
- Semi-supervised 3d object detection via adaptive pseudo-labeling. In ICIP, 2021.
- Bignas: Scaling up neural architecture search with big single-stage models. In ECCV, 2020.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022a.
- Resnest: Split-attention networks. In CVPR, 2022b.
- Customized segment anything model for medical image segmentation. In arXiv preprint arXiv:2304.13785, 2023.
- Exploring self-attention for image recognition. In CVPR, 2020a.
- Suppress and balance: A simple gated network for salient object detection. In ECCV, 2020b.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
- Uncertainty-aware consistency regularization for cross-domain semantic segmentation. CVIU, 2022a.
- Context-aware mixup for domain adaptive semantic segmentation. TCSVT, 2022b.
- Adaptive mixture of experts learning for generalizable face anti-spoofing. In ACM MM, 2022c.
- Generative domain adaptation for face anti-spoofing. In ECCV, 2022d.
- Domain adaptive semantic segmentation via regional contrastive consistency regularization. In ICME, 2022e.
- Self-adversarial disentangling for specific domain adaptation. TPAMI, 2023a.
- Transvod: end-to-end video object detection with spatial-temporal transformers. TPAMI, 2023b.
- Instance-aware domain generalization for face anti-spoofing. In CVPR, 2023c.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
- Salient object detection via integrity learning. PAMI, 2022.
- Salient object detection via integrity learning. TPAMI, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.