Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers (2403.04200v1)

Published 7 Mar 2024 in cs.CV

Abstract: Transformers have elevated to the state-of-the-art vision architectures through innovations in attention mechanism inspired from visual perception. At present two classes of attentions prevail in vision transformers, regional and sparse attention. The former bounds the pixel interactions within a region; the latter spreads them across sparse grids. The opposing natures of them have resulted in a dilemma between either preserving hierarchical relation or attaining a global context. In this work, taking inspiration from atrous convolution, we introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information, while maintaining hierarchical relations. As a further tribute to atrous convolution, we redesign the ubiquitous inverted residual convolution blocks with atrous convolution. Finally, we propose a generalized, hybrid vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks. Our tiny version model achieves $\sim 84 \%$ accuracy on ImageNet-1K, with less than $28.5$ million parameters, which is $0.42\%$ improvement over state-of-the-art MaxViT while having $8.4\%$ less parameters. In addition, we have investigated the efficacy of ACC-ViT backbone under different evaluation settings, such as finetuning, linear probing, and zero-shot learning on tasks involving medical image analysis, object detection, and language-image contrastive learning. ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Dataset of breast ultrasound images. Data in Brief, 28:104863, 2020.
  2. Flamingo: a Visual Language Model for Few-Shot Learning. ArXiv, abs/2204.14198, 2022.
  3. Layer Normalization. 2016.
  4. RegionViT: Regional-to-Local Attention for Vision Transformers. 2021a.
  5. MMDetection: Open MMLab Detection Toolbox and Benchmark. 2019.
  6. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. 2016.
  7. Rethinking Atrous Convolution for Semantic Image Segmentation. 2017.
  8. ProxIQA: A Proxy Approach to Perceptual Optimization of Learned Image Compression. IEEE Transactions on Image Processing, 30:360–373, 2021b.
  9. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. 2022a.
  10. A Generalist Framework for Panoptic Segmentation of Images and Videos. 2022b.
  11. Mobile-Former: Bridging MobileNet and Transformer. 2021c.
  12. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. 2021.
  13. EyePACS: An Adaptable Telemedicine System for Diabetic Retinopathy Screening. Journal of Diabetes Science and Technology, 3(3):509–516, 2009.
  14. CoAtNet: Marrying Convolution and Attention for All Data Sizes. 2021.
  15. Scaling Vision Transformers to 22 Billion Parameters. 2023.
  16. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
  17. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 4171–4186, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics.
  18. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems, pages 8780–8794. Curran Associates, Inc., 2021.
  19. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2021.
  20. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. 2021.
  21. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. 2021.
  22. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE, 2016.
  23. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017.
  24. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022.
  25. A Real-Time Algorithm for Signal Analysis with the Help of the Wavelet Transform. pages 286–297. 1990.
  26. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. 2017.
  27. Squeeze-and-Excitation Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7132–7141. IEEE, 2018.
  28. Atrous Pyramid Transformer with Spectral Convolution for Image Inpainting. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4674–4683, New York, NY, USA, 2022. ACM.
  29. ACC-UNet: A Completely Convolutional UNet Model for the 2020s. pages 692–702. 2023.
  30. Visual Prompt Tuning. In European Conference on Computer Vision (ECCV), 2022.
  31. Vision Transformer-Based Feature Extraction for Generalized Zero-Shot Learning. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  32. Segment Anything. 2023.
  33. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
  34. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, pages 3744–3753. PMLR, 2019.
  35. ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models. 2022.
  36. COMISR: Compression-Informed Video Super-Resolution. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2523–2532. IEEE, 2021a.
  37. LocalViT: Bringing Locality to Vision Transformers. 2021b.
  38. Feature Pyramid Networks for Object Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944. IEEE, 2017.
  39. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. 2023.
  40. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021.
  41. A ConvNet for the 2020s. 2022.
  42. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. 2017.
  43. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. 2021.
  44. Separable Self-attention for Mobile Vision Transformers. 2022.
  45. Action-Conditional Video Prediction using Deep Networks in Atari Games. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2015.
  46. AI in Medical Imaging Informatics: Current Challenges and Future Directions. IEEE Journal of Biomedical and Health Informatics, 24(7):1837–1857, 2020.
  47. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  48. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. 2020.
  49. Learning Transferable Visual Models From Natural Language Supervision. 2021.
  50. Alex Rogozhnikov. Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation. In International Conference on Learning Representations, 2022.
  51. U-Net: Convolutional Networks for Biomedical Image Segmentation. pages 234–241. 2015.
  52. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. 2016.
  53. Self-Attention with Relative Position Representations. 2018.
  54. Generative Modeling by Estimating Gradients of the Data Distribution. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  55. Going Deeper With Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  56. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019.
  57. TorchVision maintainers and contributors. TorchVision: PyTorch’s Computer Vision library. https://github.com/pytorch/vision, 2016.
  58. Training data-efficient image transformers & distillation through attention. 2020.
  59. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1):180161, 2018.
  60. MaxViT: Multi-Axis Vision Transformer. 2022.
  61. Attention is All you Need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  62. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021a.
  63. CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention. 2021b.
  64. CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention. 2023.
  65. Rich features for perceptual quality assessment of UGC videos. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13430–13439. IEEE, 2021c.
  66. Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers. 2023.
  67. ResNet strikes back: An improved training procedure in timm. 2021.
  68. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. 2023.
  69. Vision Transformer with Deformable Attention. 2022.
  70. Early Convolutions Help Transformers See Better. 2021.
  71. MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models. 2022a.
  72. Focal Self-attention for Local-Global Interactions in Vision Transformers. 2021.
  73. Focal Modulation Networks. 2022b.
  74. MetaFormer Is Actually What You Need for Vision. 2021.
  75. Graph Transformer Networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  76. The Sound of Pixels. 2018.
  77. ConvNets vs. Transformers: Whose Visual Representations are More Transferable? In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2230–2238. IEEE, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Nabil Ibtehaz (18 papers)
  2. Ning Yan (20 papers)
  3. Masood Mortazavi (11 papers)
  4. Daisuke Kihara (16 papers)
Citations (2)