MoH: Multi-Head Attention as Mixture-of-Head Attention (2410.11842v2)
Abstract: In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
- Open llm leaderboard (2023-2024). https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, 2023.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Pythia: A suite for analyzing large language models across training and scaling. In ICML, pp. 2397–2430, 2023.
- Piqa: Reasoning about physical commonsense in natural language. In AAAI, pp. 7432–7439, 2020.
- Language models are few-shot learners. In NeurIPS, pp. 1877–1901, 2020.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Multi-head attention: Collaborate instead of concatenate. arXiv preprint arXiv:2006.16362, 2020.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, pp. 702–703, 2020.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
- Coatnet: Marrying convolution and attention for all data sizes. In NeurIPS, pp. 3965–3977, 2021.
- Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248–255, 2009.
- Diffusion models beat gans on image synthesis. In NeurIPS, pp. 8780–8794, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Glam: Efficient scaling of language models with mixture-of-experts. In ICML, pp. 5547–5569, 2022.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
- Measuring massive multitask language understanding. In ICLR, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022.
- Deep networks with stochastic depth. In ECCV, pp. 646–661, 2016.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In NeurIPS, 2023.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Expectation-maximization contrastive learning for compact video-and-language representations. In NeurIPS, pp. 30291–30306, 2022.
- Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In CVPR, pp. 2472–2482, 2023a.
- Diffusionret: Generative text-video retrieval with diffusion model. In ICCV, pp. 2470–2481, 2023b.
- Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, pp. 13700–13710, 2024a.
- Moe++: Accelerating mixture-of-experts methods with zero-computation experts. arXiv preprint arXiv:2410.07348, 2024b.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, volume 1, pp. 2, 2019.
- Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- Improved precision and recall metric for assessing generative models. In NeurIPS, 2019.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In ICLR, 2021.
- Base layers: Simplifying training of large, sparse models. In ICML, pp. 6265–6274, 2021.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023a.
- Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI, 45(10):12581–12600, 2023b.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, pp. 4804–4814, 2022.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
- Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
- TruthfulQA: Measuring how models mimic human falsehoods. In ACL, pp. 3214–3252, 2022.
- Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. In ICML, 2024.
- Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In IJCAI, pp. 3622–3628, 2020.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pp. 10012–10022, 2021.
- Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In CVPR, pp. 4942–4952, 2022.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Are sixteen heads really better than one? In NeurIPS, pp. 14014–14024, 2019.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
- Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
- OpenAI. Introducing chatgpt. CoRR, 2022. URL https://openai.com/blog/chatgpt.
- Training language models to follow instructions with human feedback. In NeurIPS, pp. 27730–27744, 2022.
- The lambada dataset: Word prediction requiring a broad discourse context. In ACL, pp. 1525–1534, 2016.
- Scalable diffusion models with transformers. In ICCV, pp. 4195–4205, 2023.
- From sparse to soft mixtures of experts. In ICLR, 2024.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In ICML, pp. 18332–18346, 2022.
- Hash layers for large sparse models. In NeurIPS, pp. 17555–17566, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, pp. 10684–10695, 2022.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Improved techniques for training gans. In NeurIPS, 2016.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Dai Shi. Transnext: Robust foveal visual perception for vision transformers. In CVPR, pp. 17773–17783, 2024.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Inception transformer. In NeurIPS, pp. 23495–23509, 2022.
- Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.00159.
- Rethinking the inception architecture for computer vision. In CVPR, pp. 2818–2826, 2016.
- Training data-efficient image transformers & distillation through attention. In ICML, pp. 10347–10357, 2021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, pp. 5797–5808, 2019.
- Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. arXiv preprint arXiv:2406.18139, 2024.
- Q-sparse: All large language models can be fully sparsely-activated. arXiv preprint arXiv:2407.10969, 2024.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
- Skywork-moe: A deep dive into training techniques for mixture-of-experts language models. arXiv preprint arXiv:2406.06563, 2024.
- Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017.
- Moat: Alternating mobile convolution and attention brings strong vision models. In ICLR, 2022a.
- Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
- Focal modulation networks. In NeurIPS, pp. 4203–4217, 2022b.
- Metaformer baselines for vision. TPAMI, 2023.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, pp. 558–567, 2021.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pp. 6023–6032, 2019.
- Hellaswag: Can a machine really finish your sentence? In ACL, pp. 4791–4800, 2019.
- Hongyi Zhang. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Mixture of attention heads: Selecting attention heads per token. arXiv preprint arXiv:2210.05144, 2022.
- Random erasing data augmentation. In AAAI, pp. 13001–13008, 2020.
- Mixture-of-experts with expert choice routing. In NeurIPS, pp. 7103–7114, 2022.