Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention (2312.07987v3)

Published 13 Dec 2023 in cs.LG, cs.CL, and cs.NE

Abstract: Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer LLMs, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the LLMing performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Neural machine translation by jointly learning to align and translate. In Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, May 2015.
  2. Tom B Brown et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only, December 2020.
  3. Sparks of artificial general intelligence: Early experiments with GPT-4. Preprint arXiv:2303.12712, 2023.
  4. On the representation collapse of sparse mixture of experts. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, Louisiana, USA, December 2022.
  5. Rethinking attention with performers. In Int. Conf. on Learning Representations (ICLR), Virtual only, May 2021.
  6. Unified scaling laws for routed language models. Preprint arXiv:2202.01169, 2022.
  7. The neural data router: Adaptive control flow in transformers improves systematic generalization. In Int. Conf. on Learning Representations (ICLR), Virtual only, April 2022.
  8. Approximating two-layer feedforward networks for efficient transformers. In Findings of the Association for Computational Linguistics: EMNLP 2023, November 2023.
  9. Transformer-XL: Attentive language models beyond a fixed-length context. In Proc. Association for Computational Linguistics (ACL), pp.  2978–2988, Florence, Italy, 2019.
  10. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, Louisiana, USA, December 2022.
  11. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Preprint arXiv:2101.03961, 2021.
  12. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research (JMLR), 23(1):5232–5270, 2022.
  13. Georgi Gerganov. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023.
  14. Long short-term memory. Neural computation, pp.  1735–1780, 1997.
  15. Marcus Hutter. The human knowledge compression prize. http://prize.hutter1.net, 2006.
  16. Cybernetic Predicting Devices. CCM Information Corporation, 1965.
  17. Adaptive mixtures of local experts. Neural Compututaion, 3(1):79–87, 1991.
  18. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), volume 119, pp.  5156–5165, Virtual Only, 2020.
  19. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, May 2015.
  20. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp.  66–71, Brussels, Belgium, October 2018.
  21. GShard: Scaling giant models with conditional computation and automatic sharding. In Int. Conf. on Learning Representations (ICLR), Virtual only, May 2021.
  22. BASE layers: Simplifying training of large, sparse models. In Marina Meila and Tong Zhang (eds.), Proc. Int. Conf. on Machine Learning (ICML), volume 139, pp.  6265–6274, Virtual only, July 2021.
  23. ListOps: A diagnostic dataset for latent tree learning. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), pp.  92–99, New Orleans, USA, June 2018.
  24. Improving transformer with an admixture of attention heads. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, November 2022.
  25. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  26. OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
  27. OpenAI. GPT-4 technical report. Preprint arXiv:2303.08774, 2023.
  28. A mixture of h - 1 heads is better than h heads. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proc. Association for Computational Linguistics (ACL), pp.  6566–6577, Virtual only, July 2020.
  29. Random feature attention. In Int. Conf. on Learning Representations (ICLR), Virtual only, 2021.
  30. Language models are unsupervised multitask learners. 2019.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21:140:1–140:67, 2020.
  32. Linear transformers are secretly fast weight programmers. In Proc. Int. Conf. on Machine Learning (ICML), volume 139, pp.  9355–9366, Virtual only, 2021.
  33. Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, March 1991.
  34. Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131–139, 1992.
  35. Japanese and korean voice search. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.  5149–5152, Kyoto, Japan, March 2012.
  36. Neural machine translation of rare words with subword units. In Proc. Association for Computational Linguistics (ACL), pp.  1715–1725, Berlin, Germany, August 2016.
  37. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Int. Conf. on Learning Representations (ICLR), Toulon, France, April 2017.
  38. peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical report, Allen Institute for AI, 2023. https://github.com/allenai/pes2o.
  39. RoFormer: Enhanced transformer with rotary position embedding. Preprint arXiv:2104.09864, 2021.
  40. LLaMA: Open and efficient foundation language models. Preprint arXiv:2302.13971, 2023.
  41. Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS), pp.  5998–6008, Long Beach, CA, USA, December 2017.
  42. Mixture of attention heads: Selecting attention heads per token. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp.  4150–4162, Abu Dhabi, United Arab Emirates, December 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Róbert Csordás (25 papers)
  2. Piotr Piękos (6 papers)
  3. Kazuki Irie (35 papers)
  4. Jürgen Schmidhuber (124 papers)
Citations (6)

Summary

Introduction

Transformers have significantly impacted the field of natural language processing, demonstrating impressive capabilities across various tasks. Despite their success, they have one major limitation: transformer models, especially large ones, require substantial computational power and memory, making them inaccessible to many researchers and institutions. Scaling these models efficiently is an important yet challenging problem. Mixture-of-Experts (MoE) techniques have been considered for improving parameter efficiency, but their use in attention mechanisms has been less explored. This paper introduces SwitchHead, an MoE-based attention method, to alleviate resource demands while maintaining comparable model performance.

Methodology and Results

SwitchHead reduces memory and compute requirements by minimizing the number of attention matrices without sacrificing expressiveness. It applies MoE to the projections of value and output, computing fewer attention matrices than traditional transformers. The method's effectiveness is demonstrated on both small and LLMing datasets, achieving similar or better performance compared to baselines with the same parameter count while significantly lowering computational cost. Analyses of attention maps and expert selection indicate substantial reduction in redundancy.

Discussion

The paper highlights the potential of reducing the computational burden in Transformer models with MoE techniques, specifically by using MoE for selective parts of the attention mechanism. Additionally, the research encompasses various datasets and model sizes, confirming the broad applicability of SwitchHead. Furthermore, the combined MoE approach for both MLP and attention layers (termed "SwitchAll") offers a fully MoE-based Transformer model that is competitive against traditional dense models. These advancements offer a path toward more accessible and efficient Transformer models.

Conclusion

SwitchHead proposes a novel, resource-efficient attention mechanism that uses MoE layers, demonstrating comparable LLMing performance with reduced compute and memory requirements. This approach opens the door for training and inference of powerful LLMs on less resource-intensive infrastructure. The paper's findings suggest potential for scaling up neural networks with MoE models, which could democratize access to advanced AI capabilities. The provided open-source code enables further exploration and adoption by the wider research community.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com