Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoH: Multi-Head Attention as Mixture-of-Head Attention (2410.11842v2)

Published 15 Oct 2024 in cs.CV, cs.AI, and cs.LG

Abstract: In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. Open llm leaderboard (2023-2024). https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, 2023.
  2. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  3. Pythia: A suite for analyzing large language models across training and scaling. In ICML, pp.  2397–2430, 2023.
  4. Piqa: Reasoning about physical commonsense in natural language. In AAAI, pp.  7432–7439, 2020.
  5. Language models are few-shot learners. In NeurIPS, pp.  1877–1901, 2020.
  6. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  10. Multi-head attention: Collaborate instead of concatenate. arXiv preprint arXiv:2006.16362, 2020.
  11. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, pp.  702–703, 2020.
  12. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
  13. Coatnet: Marrying convolution and attention for all data sizes. In NeurIPS, pp.  3965–3977, 2021.
  14. Imagenet: A large-scale hierarchical image database. In CVPR, pp.  248–255, 2009.
  15. Diffusion models beat gans on image synthesis. In NeurIPS, pp.  8780–8794, 2021.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  17. Glam: Efficient scaling of language models with mixture-of-experts. In ICML, pp.  5547–5569, 2022.
  18. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  19. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  20. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  21. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
  22. Measuring massive multitask language understanding. In ICLR, 2021.
  23. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  24. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022.
  25. Deep networks with stochastic depth. In ECCV, pp.  646–661, 2016.
  26. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In NeurIPS, 2023.
  27. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  28. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  29. Expectation-maximization contrastive learning for compact video-and-language representations. In NeurIPS, pp.  30291–30306, 2022.
  30. Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In CVPR, pp.  2472–2482, 2023a.
  31. Diffusionret: Generative text-video retrieval with diffusion model. In ICCV, pp.  2470–2481, 2023b.
  32. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, pp.  13700–13710, 2024a.
  33. Moe++: Accelerating mixture-of-experts methods with zero-computation experts. arXiv preprint arXiv:2410.07348, 2024b.
  34. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, volume 1, pp.  2, 2019.
  35. Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  36. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  37. Improved precision and recall metric for assessing generative models. In NeurIPS, 2019.
  38. Gshard: Scaling giant models with conditional computation and automatic sharding. In ICLR, 2021.
  39. Base layers: Simplifying training of large, sparse models. In ICML, pp.  6265–6274, 2021.
  40. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023a.
  41. Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI, 45(10):12581–12600, 2023b.
  42. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, pp.  4804–4814, 2022.
  43. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  44. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
  45. TruthfulQA: Measuring how models mimic human falsehoods. In ACL, pp.  3214–3252, 2022.
  46. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. In ICML, 2024.
  47. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In IJCAI, pp.  3622–3628, 2020.
  48. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pp.  10012–10022, 2021.
  49. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In CVPR, pp.  4942–4952, 2022.
  50. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  51. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  52. Are sixteen heads really better than one? In NeurIPS, pp.  14014–14024, 2019.
  53. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  54. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  55. OpenAI. Introducing chatgpt. CoRR, 2022. URL https://openai.com/blog/chatgpt.
  56. Training language models to follow instructions with human feedback. In NeurIPS, pp.  27730–27744, 2022.
  57. The lambada dataset: Word prediction requiring a broad discourse context. In ACL, pp.  1525–1534, 2016.
  58. Scalable diffusion models with transformers. In ICCV, pp.  4195–4205, 2023.
  59. From sparse to soft mixtures of experts. In ICLR, 2024.
  60. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  61. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In ICML, pp.  18332–18346, 2022.
  62. Hash layers for large sparse models. In NeurIPS, pp.  17555–17566, 2021.
  63. High-resolution image synthesis with latent diffusion models. In CVPR, pp.  10684–10695, 2022.
  64. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  65. Improved techniques for training gans. In NeurIPS, 2016.
  66. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  67. Dai Shi. Transnext: Robust foveal visual perception for vision transformers. In CVPR, pp.  17773–17783, 2024.
  68. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  69. Inception transformer. In NeurIPS, pp.  23495–23509, 2022.
  70. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.00159.
  71. Rethinking the inception architecture for computer vision. In CVPR, pp.  2818–2826, 2016.
  72. Training data-efficient image transformers & distillation through attention. In ICML, pp.  10347–10357, 2021.
  73. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  74. Attention is all you need. In NeurIPS, 2017.
  75. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, pp.  5797–5808, 2019.
  76. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. arXiv preprint arXiv:2406.18139, 2024.
  77. Q-sparse: All large language models can be fully sparsely-activated. arXiv preprint arXiv:2407.10969, 2024.
  78. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
  79. Skywork-moe: A deep dive into training techniques for mixture-of-experts language models. arXiv preprint arXiv:2406.06563, 2024.
  80. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017.
  81. Moat: Alternating mobile convolution and attention brings strong vision models. In ICLR, 2022a.
  82. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
  83. Focal modulation networks. In NeurIPS, pp.  4203–4217, 2022b.
  84. Metaformer baselines for vision. TPAMI, 2023.
  85. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, pp.  558–567, 2021.
  86. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pp.  6023–6032, 2019.
  87. Hellaswag: Can a machine really finish your sentence? In ACL, pp.  4791–4800, 2019.
  88. Hongyi Zhang. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  89. Mixture of attention heads: Selecting attention heads per token. arXiv preprint arXiv:2210.05144, 2022.
  90. Random erasing data augmentation. In AAAI, pp.  13001–13008, 2020.
  91. Mixture-of-experts with expert choice routing. In NeurIPS, pp.  7103–7114, 2022.
Citations (2)

Summary

  • The paper introduces dynamic attention-head routing that selects the most relevant heads, boosting inference efficiency without compromising accuracy.
  • It employs shared heads with a two-stage routing mechanism to balance common and specialized knowledge across different modalities.
  • The method integrates a load balance loss to evenly distribute activations, yielding improved performance in Vision Transformers, Diffusion models, and Large Language Models.

Multi-Head Attention as Mixture-of-Head Attention: A Comprehensive Overview

The paper "MoH: Multi-Head Attention as Mixture-of-Head Attention" presents a novel enhancement to the multi-head attention mechanism, a foundational element of the Transformer architecture. The authors propose the Mixture-of-Head attention (MoH), introducing several key innovations aimed at enhancing both the efficiency and performance of multi-head attention.

Core Contributions and Methodology

The central thesis of the paper is that not all attention heads in the traditional multi-head attention design contribute equally to the model's output. To address this, the authors adopt a framework reminiscent of Mixture-of-Experts (MoE) models, which leverages sparse activation to optimize computational resources.

  1. Dynamic Attention-Head Routing: This mechanism allows each token to select only the most relevant attention heads, thus improving inference efficiency without sacrificing accuracy. By employing a weighted summation instead of the typical direct summation, the flexibility and robustness of the attention mechanism are enhanced.
  2. Shared Heads and Two-Stage Routing: To capture common knowledge across different contexts, a subset of heads is always activated (shared heads), enabling other heads to focus on more specialized knowledge. The two-stage routing mechanism further facilitates a dynamic balance between shared and routed heads, contributing to improved model performance.
  3. Load Balance Loss: This component ensures that the distribution of activations across attention heads remains balanced, preventing any subset of heads from becoming overutilized or undertrained, a common issue in MoE models.

Experimental Validation

The authors rigorously evaluate MoH across multiple well-known model frameworks, including Vision Transformers (ViT), Diffusion models (DiT), and LLMs. The experimental results consistently demonstrate MoH's superior efficiency, achieving comparable or even improved performance while utilizing fewer attention heads (50%-90%):

  • Vision Transformers: MoH-ViT achieves high accuracy in image classification tasks, surpassing traditional attention models despite activating fewer attention heads.
  • Diffusion Models: The results on class-conditional image generation tasks confirm that MoH can handle dense prediction tasks efficiently, although a higher percentage of heads might be necessary compared to classification tasks.
  • LLMs: MoH-LLMs, even when trained from scratch or continue-tuned from pre-existing models like LLaMA3-8B, show a notable increase in performance metrics across various language benchmarks.

Implications and Future Perspectives

The introduction of MoH could represent a step forward in the design of attention-based models, aligning with ongoing efforts to optimize computational efficiency in deep learning. The MoH approach not only refines how resources are allocated during inference but also introduces a method for seamlessly integrating pre-trained models, enlarging its scope of application.

In future work, further exploration into heterogeneous head sizes and the extension of MoH into multimodal or more complex sequential tasks could unlock additional benefits. Given its adaptability, MoH holds promise for both research and industry, aiding in the construction of models that are not only more efficient but potentially more interpretable and adaptive to specific task requirements.

In summary, this paper presents a substantive advancement in the architecture of attention mechanisms. MoH offers a promising alternative to conventional designs by emphasizing adaptability and efficiency without increasing model complexity. This could significantly influence both theoretical and practical advancements in AI, particularly in resource-constrained environments.

Youtube Logo Streamline Icon: https://streamlinehq.com