Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model (2401.02317v4)

Published 4 Jan 2024 in cs.CV

Abstract: In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. Code is available at https://github.com/zongzi13545329/BA-SAM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  2. Irwan Bello. Lambdanetworks: Modeling long-range interactions without attention. In CVPR, 2021.
  3. Attention augmented convolutional networks. In ICCV, 2019.
  4. Flexivit: One model for all patch sizes. In CVPR, 2023.
  5. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  6. Once for all: Train one network and specialize it for efficient deployment. ICLR, 2020.
  7. Adaptformer: Adapting vision transformers for scalable visual recognition. NeurIPS, 2022.
  8. Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more. In arXiv preprint arXiv:2304.09148, 2023.
  9. Exploring simple siamese representation learning. In CVPR, 2021.
  10. Dynamic convolution: Attention over convolution kernels. In CVPR, 2020.
  11. Kerple: Kernelized relative positional embedding for length extrapolation. NeurIPS, 35, 2022.
  12. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In ISBI, 2018.
  13. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. arXiv preprint arXiv:2307.06304, 2023.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Salient objects in clutter: Bringing salient object detection to the foreground. In ECCV, 2018.
  16. Camouflaged object detection. In CVPR, 2020.
  17. Concealed object detection. TPAMI, 2021a.
  18. Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks. TNNLS, 2021b.
  19. Re-thinking co-salient object detection. TPAMI, 2022a.
  20. Salient objects in clutter. TPAMI, 2022b.
  21. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
  22. Dmt: Dynamic mutual training for semi-supervised learning. Pattern Recognition, 2022.
  23. Convolutional sequence to sequence learning. In ICML, 2017.
  24. Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In ICLR, 2021.
  25. Levit: a vision transformer in convnet’s clothing for faster inference. In ICCV, 2021.
  26. Pit: Position-invariant transform for cross-fov domain adaptation. In ICCV, 2021.
  27. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  28. Towards a unified view of parameter-efficient transfer learning. In ICLR, 2021a.
  29. Deep residual learning for image recognition. In CVPR, 2016.
  30. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  31. End-to-end video object detection with spatial-temporal transformers. In ACM MM, 2021b.
  32. Squeeze-and-excitation networks. In CVPR, 2018.
  33. Sam struggles in concealed scenes–empirical study on” segment anything”. arXiv preprint arXiv:2304.06022, 2023a.
  34. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750, 2023b.
  35. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  36. Visual prompt tuning. In ECCV, 2022a.
  37. Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In CVPR, 2022b.
  38. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  39. Matryoshka representations for adaptive deployment. arXiv preprint arXiv:2205.13147, 2022.
  40. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In ICML, 2023.
  41. Selective kernel networks. In CVPR, 2019.
  42. Prefix-tuning: Optimizing continuous prompts for generation. In ACL), 2021.
  43. Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
  44. Super vision transformer. IJCV, 2022.
  45. Microsoft coco: Common objects in context. In ECCV, 2014.
  46. Visual saliency transformer. In ICCV, 2021a.
  47. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021b.
  48. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022.
  49. Diverse target and contribution scheduling for domain generalization. arXiv preprint arXiv:2309.16460, 2023a.
  50. Rethinking domain generalization: Discriminability and generalizability. arXiv preprint arXiv:2309.16483, 2023b.
  51. Image transformer. In ICML, 2018.
  52. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  53. Highly accurate dichotomous image segmentation. In ECCV, 2022.
  54. Learning transferable visual models from natural language supervision. In ICML, 2021.
  55. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  56. Self-attention with relative position representations. In Proceedings of NAACL-HLT, 2018.
  57. et al Singh, Pranav. Cass: cross architectural self-supervision for medical image analysis. arXiv preprint arXiv:2206.04170, 2022.
  58. Rethinking implicit neural representations for vision learners. In ICASSP, 2023.
  59. Bottleneck transformers for visual recognition. In CVPR, 2021.
  60. JianLin Su. Transformer upgrade road: 7, length extrapolation and local attention, 2023.
  61. Going deeper with convolutions. In CVPR, 2015.
  62. Attention is all you need. In NeurIPS, 2017.
  63. Residual attention network for image classification. In CVPR, 2017a.
  64. Learning to detect salient objects with image-level supervision. In CVPR, 2017b.
  65. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  66. Non-local neural networks. In CVPR, 2018.
  67. Large-scale multi-modal pre-trained models: A comprehensive survey. MIR, 2023.
  68. Cbam: Convolutional block attention module. In ECCV, 2018.
  69. Generating long sequences with sparse transformers. arXiv preprint arXiv:2006.03677, 2020a.
  70. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677, 2020b.
  71. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
  72. Tinyvit: Fast pretraining distillation for small vision transformers. In ECCV, 2022.
  73. Simmim: A simple framework for masked image modeling. In CVPR, 2022.
  74. Semi-supervised 3d object detection via adaptive pseudo-labeling. In ICIP, 2021.
  75. Bignas: Scaling up neural architecture search with big single-stage models. In ECCV, 2020.
  76. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
  77. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022a.
  78. Resnest: Split-attention networks. In CVPR, 2022b.
  79. Customized segment anything model for medical image segmentation. In arXiv preprint arXiv:2304.13785, 2023.
  80. Exploring self-attention for image recognition. In CVPR, 2020a.
  81. Suppress and balance: A simple gated network for salient object detection. In ECCV, 2020b.
  82. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  83. Uncertainty-aware consistency regularization for cross-domain semantic segmentation. CVIU, 2022a.
  84. Context-aware mixup for domain adaptive semantic segmentation. TCSVT, 2022b.
  85. Adaptive mixture of experts learning for generalizable face anti-spoofing. In ACM MM, 2022c.
  86. Generative domain adaptation for face anti-spoofing. In ECCV, 2022d.
  87. Domain adaptive semantic segmentation via regional contrastive consistency regularization. In ICME, 2022e.
  88. Self-adversarial disentangling for specific domain adaptation. TPAMI, 2023a.
  89. Transvod: end-to-end video object detection with spatial-temporal transformers. TPAMI, 2023b.
  90. Instance-aware domain generalization for face anti-spoofing. In CVPR, 2023c.
  91. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
  92. Salient object detection via integrity learning. PAMI, 2022.
  93. Salient object detection via integrity learning. TPAMI, 2023.
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube