Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing Vision Transformers with Group-Mix Attention (2311.15157v1)

Published 26 Nov 2023 in cs.CV

Abstract: Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from the Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that self-attention should have a more comprehensive mechanism to capture correlations among tokens and groups (i.e., multiple adjacent tokens) for higher representational capacity. Thereby, we propose Group-Mix Attention (GMA) as an advanced replacement for traditional self-attention, which can simultaneously capture token-to-token, token-to-group, and group-to-group correlations with various group sizes. To this end, GMA splits the Query, Key, and Value into segments uniformly and performs different group aggregations to generate group proxies. The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value. Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which achieves state-of-the-art performance in image classification, object detection, and semantic segmentation with fewer parameters than existing models. For instance, GroupMixFormer-L (with 70.3M parameters and 3842 input) attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Xcit: Cross-covariance image transformers. Advances in Neural Information Processing Systems, 2021.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  4. Cascade r-cnn: high quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  5. Emerging properties in self-supervised vision transformers. In IEEE/CVF International Conference on Computer Vision, 2021.
  6. Pixelated butterfly: Simple and efficient sparse training for neural network models. arXiv preprint arXiv:2112.00029, 2021a.
  7. Pixelated butterfly: Simple and efficient sparse training for neural network models. arXiv preprint arXiv:2112.00029, 2021b.
  8. Scatterbrain: Unifying sparse and low-rank attention approximation. arXiv preprint arXiv:2110.15343, 2021c.
  9. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  10. An empirical study of training self-supervised vision transformers. In IEEE/CVF International Conference on Computer Vision, 2021d.
  11. Mobile-former: Bridging mobilenet and transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  12. François Chollet. Xception: Deep learning with depthwise separable convolutions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  13. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  14. On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584, 2019.
  15. Dynamic head: Unifying object detection heads with attentions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7373–7382, 2021.
  16. Davit: Dual attention vision transformers. In European Conference on Computer Vision, 2022.
  17. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652, 2021.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  19. Multiscale vision transformers. In IEEE/CVF International Conference on Computer Vision, 2021.
  20. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  21. Effnet: An efficient structure for convolutional neural networks. In 2018 25th ieee international conference on image processing (icip), 2018.
  22. Levit: a vision transformer in convnet’s clothing for faster inference. In IEEE/CVF International Conference on Computer Vision, 2021.
  23. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
  24. Mask r-cnn. In IEEE/CVF International Conference on Computer Vision, 2017.
  25. Deep networks with stochastic depth. In European Conference on Computer Vision, 2016.
  26. Densely connected convolutional networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  27. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.
  28. All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information Processing Systems, 34:18590–18602, 2021.
  29. Panoptic feature pyramid networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  30. Mpvit: Multi-path vision transformer for dense prediction. arXiv preprint arXiv:2112.11010, 2021.
  31. Uniformer: Unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022a.
  32. Mvitv2: Improved multiscale vision transformers for classification and detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022b.
  33. Not all patches are what you need: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations, 2022.
  34. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  35. Focal loss for dense object detection. In IEEE/CVF International Conference on Computer Vision, 2017.
  36. Swin transformer v2: Scaling up capacity and resolution. arXiv preprint arXiv:2111.09883, 2021a.
  37. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, 2021b.
  38. A convnet for the 2020s. arXiv preprint arXiv:2201.03545, 2022.
  39. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  40. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In European Conference on Computer Vision, 2018.
  41. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
  42. Vision transformers are robust learners. arXiv preprint arXiv:2105.07581, 2021.
  43. Do vision transformers see like convolutional neural networks? arXiv preprint arXiv:2108.08810, 2021.
  44. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, 2021.
  45. Shunted self-attention via multi-scale token aggregation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  46. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
  47. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021.
  48. Inception transformer. Advances in Neural Information Processing Systems, 2022.
  49. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 2021.
  50. Maxvit: Multi-axis vision transformer. In European Conference on Computer Vision, 2022.
  51. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  52. Crossformer: A versatile vision transformer hinging on cross-scale attention. arxiv 2021. arXiv preprint arXiv:2108.00154, 2018.
  53. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In IEEE/CVF International Conference on Computer Vision, 2021.
  54. Pvtv2: Improved baselines with pyramid vision transformer. Computational Visual Media, 2022.
  55. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021.
  56. P2t: Pyramid pooling transformer for scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  57. Unified perceptual parsing for scene understanding. In European Conference on Computer Vision, 2018.
  58. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 2021a.
  59. Aggregated residual transformations for deep neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  60. Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553, 2021b.
  61. Co-scale conv-attentional image transformers. In IEEE/CVF International Conference on Computer Vision, 2021.
  62. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
  63. Volo: Vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112, 2021.
  64. Cutmix: Regularization strategy to train strong classifiers with localizable features. In IEEE/CVF International Conference on Computer Vision, 2019.
  65. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  66. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In IEEE/CVF International Conference on Computer Vision, 2021.
  67. Random erasing data augmentation. In Association for the Advancement of Artificial Intelligence, 2020.
  68. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 2019.
  69. Biformer: Vision transformer with bi-level routing attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chongjian Ge (23 papers)
  2. Xiaohan Ding (41 papers)
  3. Zhan Tong (16 papers)
  4. Li Yuan (141 papers)
  5. Jiangliu Wang (14 papers)
  6. Yibing Song (65 papers)
  7. Ping Luo (340 papers)
Citations (12)