Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Agent Attention: On the Integration of Softmax and Linear Attention (2312.08874v3)

Published 14 Dec 2023 in cs.CV
Agent Attention: On the Integration of Softmax and Linear Attention

Abstract: The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at https://github.com/LeapLabTHU/Agent-Attention.

Understanding Agent Attention in Transformers

Transformers are a class of deep learning models that have revolutionized the field of natural language processing and have also made significant inroads in computer vision. The Transformer's power primarily comes from its attention mechanism, which helps the model to focus on different parts of the input data to make better predictions. However, traditional global attention mechanisms in Transformers can be computationally expensive, particularly when dealing with a large number of input tokens, as in high-resolution images.

Towards Efficient Attention Mechanisms

In the latest development, researchers have introduced a novel attention paradigm called "Agent Attention" to address the computational efficiency challenges of global Softmax-based attention in Transformers. This new approach effectively strikes a balance between computational efficiency and representation power. Unlike Softmax attention, which considers the similarity between all query-key pairs, resulting in quadratic computational complexity, Agent Attention introduces an additional set of tokens, termed "agent tokens". These tokens serve as intermediaries, aggregating information from keys and values before broadcasting it back to the queries.

Agent Attention Mechanics

Agent Attention is structured as a quadruple (Q, A, K, V), adding the agent tokens A into the conventional attention module structure. This architecture performs two sequential attention computations: first, agent tokens collect information from values using a Softmax operation between A and K; second, queries gather features from the aggregated agent features. The key innovation is that the number of agent tokens can be much smaller than the number of queries, leading to significant computational savings while maintaining the global context modeling capabilities.

Integration with Linear Attention

Interestingly, Agent Attention is also shown to be equivalent to a generalized form of linear attention, which historically has been simpler but less expressive. This equivalence allows the new attention model to inherit the benefits from both Softmax's expressiveness and linear attention's efficiency. It marries the best of the two worlds: the expressiveness of Softmax attention and the efficiency of linear attention in a seamless manner, which is empirically demonstrated through various vision tasks.

Empirical Verification

The effectiveness of Agent Attention has been tested across a spread of vision tasks, including image classification, object detection, semantic segmentation, and image generation. In each test case, the new attention mechanism provided computational advantages and, in some cases, even improved performance over traditional attention mechanisms. Remarkably, when incorporated into large diffusion models like Stable Diffusion, it accelerated image generation without any additional training while enhancing image quality.

Implications for Future Applications

The efficient nature of Agent Attention, due to its linear complexity with respect to the number of tokens and strong representational capacity, is poised to be transformative for tasks dealing with long sequences of data, such as video processing and multimodal learning. Considering its potential, Agent Attention aligns with the broader trajectory of making Transformer models increasingly scalable and applicable to ever more complex and data-intensive domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Token merging for fast stable diffusion. In CVPRW, 2023.
  2. Hydra attention: Efficient attention with many heads. In ECCVW, 2022.
  3. Token merging: Your ViT but faster. In ICLR, 2023.
  4. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
  5. End-to-end object detection with transformers. In ECCV, 2020.
  6. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
  7. Rethinking attention with performers. In ICLR, 2021.
  8. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 2020.
  9. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  10. 8-bit optimizers via block-wise quantization. In ICLR, 2022.
  11. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  12. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  14. Flatten transformer: Vision transformer using focused linear attention. In ICCV, 2023a.
  15. Dynamic perceiver for efficient visual recognition. In ICCV, 2023b.
  16. Neighborhood attention transformer. In CVPR, 2023.
  17. Mask r-cnn. In ICCV, 2017.
  18. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  19. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, 2020.
  20. Panoptic feature pyramid networks. In CVPR, 2019.
  21. Microsoft coco: Common objects in context. In ECCV, 2014.
  22. Focal loss for dense object detection. In ICCV, 2017.
  23. Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022.
  24. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  25. Decoupled weight decay regularization. In ICLR, 2018.
  26. Soft: Softmax-free transformer with linear complexity. In NeurIPS, 2021.
  27. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 1992.
  28. Learning transferable visual models from natural language supervision. In ICML, 2021.
  29. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  30. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  31. Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, 2020. Version 0.3.0.
  32. Self-attention with relative position representations. In ACL, 2018.
  33. Efficient attention: Attention with linear complexities. In WACV, 2021.
  34. Denoising diffusion implicit models. In ICLR, 2021.
  35. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  36. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  37. Attention is all you need. In NeurIPS, 2017.
  38. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  39. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 2022.
  40. Vision transformer with deformable attention. In CVPR, 2022.
  41. Unified perceptual parsing for scene understanding. In ECCV, 2018.
  42. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  43. Nyströmformer: A nyström-based algorithm for approximating self-attention. In AAAI, 2021.
  44. Castling-vit: Compressing self-attention via switching towards linear-angular attention at vision transformer inference. In CVPR, 2023.
  45. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
  46. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  47. Random erasing data augmentation. In AAAI, 2020.
  48. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  49. Biformer: Vision transformer with bi-level routing attention. In CVPR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Dongchen Han (12 papers)
  2. Tianzhu Ye (9 papers)
  3. Yizeng Han (33 papers)
  4. Zhuofan Xia (12 papers)
  5. Shiji Song (103 papers)
  6. Gao Huang (178 papers)
  7. Siyuan Pan (7 papers)
  8. Pengfei Wan (86 papers)
Citations (32)
Github Logo Streamline Icon: https://streamlinehq.com