Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MoH: Multi-Head Attention as Mixture-of-Head Attention (2410.11842v2)

Published 15 Oct 2024 in cs.CV, cs.AI, and cs.LG

Abstract: In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

References (91)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces dynamic attention-head routing that selects the most relevant heads, boosting inference efficiency without compromising accuracy.
It employs shared heads with a two-stage routing mechanism to balance common and specialized knowledge across different modalities.
The method integrates a load balance loss to evenly distribute activations, yielding improved performance in Vision Transformers, Diffusion models, and Large Language Models.

Multi-Head Attention as Mixture-of-Head Attention: A Comprehensive Overview

The paper "MoH: Multi-Head Attention as Mixture-of-Head Attention" presents a novel enhancement to the multi-head attention mechanism, a foundational element of the Transformer architecture. The authors propose the Mixture-of-Head attention (MoH), introducing several key innovations aimed at enhancing both the efficiency and performance of multi-head attention.

Core Contributions and Methodology

The central thesis of the paper is that not all attention heads in the traditional multi-head attention design contribute equally to the model's output. To address this, the authors adopt a framework reminiscent of Mixture-of-Experts (MoE) models, which leverages sparse activation to optimize computational resources.

Dynamic Attention-Head Routing: This mechanism allows each token to select only the most relevant attention heads, thus improving inference efficiency without sacrificing accuracy. By employing a weighted summation instead of the typical direct summation, the flexibility and robustness of the attention mechanism are enhanced.
Shared Heads and Two-Stage Routing: To capture common knowledge across different contexts, a subset of heads is always activated (shared heads), enabling other heads to focus on more specialized knowledge. The two-stage routing mechanism further facilitates a dynamic balance between shared and routed heads, contributing to improved model performance.
Load Balance Loss: This component ensures that the distribution of activations across attention heads remains balanced, preventing any subset of heads from becoming overutilized or undertrained, a common issue in MoE models.

Experimental Validation

The authors rigorously evaluate MoH across multiple well-known model frameworks, including Vision Transformers (ViT), Diffusion models (DiT), and LLMs. The experimental results consistently demonstrate MoH's superior efficiency, achieving comparable or even improved performance while utilizing fewer attention heads (50%-90%):

Vision Transformers: MoH-ViT achieves high accuracy in image classification tasks, surpassing traditional attention models despite activating fewer attention heads.
Diffusion Models: The results on class-conditional image generation tasks confirm that MoH can handle dense prediction tasks efficiently, although a higher percentage of heads might be necessary compared to classification tasks.
LLMs: MoH-LLMs, even when trained from scratch or continue-tuned from pre-existing models like LLaMA3-8B, show a notable increase in performance metrics across various language benchmarks.

Implications and Future Perspectives

The introduction of MoH could represent a step forward in the design of attention-based models, aligning with ongoing efforts to optimize computational efficiency in deep learning. The MoH approach not only refines how resources are allocated during inference but also introduces a method for seamlessly integrating pre-trained models, enlarging its scope of application.

In future work, further exploration into heterogeneous head sizes and the extension of MoH into multimodal or more complex sequential tasks could unlock additional benefits. Given its adaptability, MoH holds promise for both research and industry, aiding in the construction of models that are not only more efficient but potentially more interpretable and adaptive to specific task requirements.

In summary, this paper presents a substantive advancement in the architecture of attention mechanisms. MoH offers a promising alternative to conventional designs by emphasizing adaptability and efficiency without increasing model complexity. This could significantly influence both theoretical and practical advancements in AI, particularly in resource-constrained environments.

PDF Markdown

Tweets

https://twitter.com/ShauryaSharthak/status/1924218079313031509

https://twitter.com/gm8xx8/status/1846398022218887577

https://twitter.com/pj95270018/status/1846601964534354170

https://twitter.com/arXivGPT/status/1847727395265990787

https://twitter.com/arXivGPT/status/1848454910137815223

https://twitter.com/arXivGPT/status/1848092379653226725

YouTube

Show All Videos