Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interactive Multi-Head Self-Attention with Linear Complexity (2402.17507v1)

Published 27 Feb 2024 in cs.CV

Abstract: We propose an efficient interactive method for multi-head self-attention via decomposition. For existing methods using multi-head self-attention, the attention operation of each head is computed independently. However, we show that the interactions between cross-heads of the attention matrix enhance the information flow of the attention operation. Considering that the attention matrix of each head can be seen as a feature of networks, it is beneficial to establish connectivity between them to capture interactions better. However, a straightforward approach to capture the interactions between the cross-heads is computationally prohibitive as the complexity grows substantially with the high dimension of an attention matrix. In this work, we propose an effective method to decompose the attention operation into query- and key-less components. This will result in a more manageable size for the attention matrix, specifically for the cross-head interactions. Expensive experimental results show that the proposed cross-head interaction approach performs favorably against existing efficient attention methods and state-of-the-art backbone models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Are we done with imagenet? arXiv, 2020.
  2. High-performance large-scale image recognition without normalization. In ICML, pp.  1059–1071, 2021.
  3. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In ICCV, pp.  17302–17313, 2023.
  4. Cascade r-cnn: Delving into high quality object detection. In CVPR, pp.  6154–6162, 2018.
  5. Scatterbrain: Unifying sparse and low-rank attention approximation. arXiv, 2021.
  6. MMDetection: Open mmlab detection toolbox and benchmark. arXiv, 2019.
  7. Generating long sequences with sparse transformers. arXiv, 2019.
  8. Rethinking attention with performers. arXiv, 2020.
  9. Contributors, M. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, pp.  702–703, 2020.
  11. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34:3965–3977, 2021.
  12. Imagenet: A large-scale hierarchical image database. In ICCV, pp.  248–255, 2009.
  13. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, pp.  12124–12134, 2022.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  15. Flatten transformer: Vision transformer using focused linear attention. In ICCV, pp.  5961–5971, 2023.
  16. Neighborhood attention transformer. In CVPR, pp.  6185–6194, 2023.
  17. Mask r-cnn. In ICCV, pp.  2961–2969, 2017.
  18. Squeeze-and-excitation networks. In CVPR, pp.  7132–7141, 2018.
  19. Orthogonal transformer: An efficient vision transformer backbone with token orthogonalization. NeurIPS, 35:14596–14607, 2022.
  20. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, pp.  604–613, 1998.
  21. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, pp.  5156–5165, 2020.
  22. Reformer: The efficient transformer. arXiv, 2020.
  23. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, 2009.
  24. Efficient multi-order gated aggregation network. ICLR, 2024.
  25. Microsoft coco: Common objects in context. In ECCV, pp.  740–755, 2014.
  26. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pp.  10012–10022, 2021.
  27. A convnet for the 2020s. In CVPR, pp.  11976–11986, 2022.
  28. How do vision transformers work? In ICLR, 2022.
  29. cosformer: Rethinking softmax in attention. arXiv, 2022.
  30. Do imagenet classifiers generalize to imagenet? In ICML, pp.  5389–5400, 2019.
  31. Combiner: Full attention transformer with sparse computation cost. NeurIPS, 34:22470–22482, 2021.
  32. Efficient attention: Attention with linear complexities. In WACV, pp.  3531–3539, 2021.
  33. Training data-efficient image transformers & distillation through attention. In ICML, pp.  10347–10357, 2021.
  34. Maxvit: Multi-axis vision transformer. arXiv, 2022.
  35. Attention is all you need. NeurIPS, 30, 2017.
  36. Fast transformers with clustered attention. NeurIPS, 33:21665–21674, 2020.
  37. Residual attention network for image classification. In CVPR, pp.  3156–3164, 2017.
  38. Linformer: Self-attention with linear complexity. arXiv, 2020.
  39. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pp.  568–578, 2021.
  40. Pvt v2: Improved baselines with pyramid vision transformer. CVM, 8(3):415–424, 2022.
  41. Cbam: Convolutional block attention module. In ECCV, pp.  3–19, 2018.
  42. Unified perceptual parsing for scene understanding. In ECCV, pp.  418–434, 2018.
  43. Nyströmformer: A nyström-based algorithm for approximating self-attention. In AAAI, volume 35 Issue 16, pp.  14138–14148, 2021.
  44. Focal self-attention for local-global interactions in vision transformers. arXiv, 2021.
  45. Metaformer is actually what you need for vision. In CVPR, pp.  10819–10829, 2022.
  46. Inceptionnext: When inception meets convnext. arXiv, 2023.
  47. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pp.  6023–6032, 2019.
  48. mixup: Beyond empirical risk minimization. arXiv, 2017.
  49. Linear complexity randomized self-attention mechanism. In ICML, pp.  27011–27041, 2022.
  50. Efficient attention via control variates. arXiv, 2023.
  51. Scene parsing through ade20k dataset. In CVPR, pp.  633–641, 2017.
  52. Deepvit: Towards deeper vision transformer. arXiv, 2021a.
  53. Refiner: Refining self-attention for vision transformers. arXiv, 2021b.
  54. Long-short transformer: Efficient transformers for language and vision. NeurIPS, 34:17723–17736, 2021.
  55. Biformer: Vision transformer with bi-level routing attention. In CVPR, pp.  10323–10333, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.