SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention (2312.07987v3)
Abstract: Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer LLMs, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the LLMing performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.
- Neural machine translation by jointly learning to align and translate. In Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, May 2015.
- Tom B Brown et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only, December 2020.
- Sparks of artificial general intelligence: Early experiments with GPT-4. Preprint arXiv:2303.12712, 2023.
- On the representation collapse of sparse mixture of experts. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, Louisiana, USA, December 2022.
- Rethinking attention with performers. In Int. Conf. on Learning Representations (ICLR), Virtual only, May 2021.
- Unified scaling laws for routed language models. Preprint arXiv:2202.01169, 2022.
- The neural data router: Adaptive control flow in transformers improves systematic generalization. In Int. Conf. on Learning Representations (ICLR), Virtual only, April 2022.
- Approximating two-layer feedforward networks for efficient transformers. In Findings of the Association for Computational Linguistics: EMNLP 2023, November 2023.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proc. Association for Computational Linguistics (ACL), pp. 2978–2988, Florence, Italy, 2019.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, Louisiana, USA, December 2022.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Preprint arXiv:2101.03961, 2021.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research (JMLR), 23(1):5232–5270, 2022.
- Georgi Gerganov. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023.
- Long short-term memory. Neural computation, pp. 1735–1780, 1997.
- Marcus Hutter. The human knowledge compression prize. http://prize.hutter1.net, 2006.
- Cybernetic Predicting Devices. CCM Information Corporation, 1965.
- Adaptive mixtures of local experts. Neural Compututaion, 3(1):79–87, 1991.
- Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), volume 119, pp. 5156–5165, Virtual Only, 2020.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, May 2015.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 66–71, Brussels, Belgium, October 2018.
- GShard: Scaling giant models with conditional computation and automatic sharding. In Int. Conf. on Learning Representations (ICLR), Virtual only, May 2021.
- BASE layers: Simplifying training of large, sparse models. In Marina Meila and Tong Zhang (eds.), Proc. Int. Conf. on Machine Learning (ICML), volume 139, pp. 6265–6274, Virtual only, July 2021.
- ListOps: A diagnostic dataset for latent tree learning. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), pp. 92–99, New Orleans, USA, June 2018.
- Improving transformer with an admixture of attention heads. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, November 2022.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
- OpenAI. GPT-4 technical report. Preprint arXiv:2303.08774, 2023.
- A mixture of h - 1 heads is better than h heads. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proc. Association for Computational Linguistics (ACL), pp. 6566–6577, Virtual only, July 2020.
- Random feature attention. In Int. Conf. on Learning Representations (ICLR), Virtual only, 2021.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21:140:1–140:67, 2020.
- Linear transformers are secretly fast weight programmers. In Proc. Int. Conf. on Machine Learning (ICML), volume 139, pp. 9355–9366, Virtual only, 2021.
- Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, March 1991.
- Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131–139, 1992.
- Japanese and korean voice search. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152, Kyoto, Japan, March 2012.
- Neural machine translation of rare words with subword units. In Proc. Association for Computational Linguistics (ACL), pp. 1715–1725, Berlin, Germany, August 2016.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Int. Conf. on Learning Representations (ICLR), Toulon, France, April 2017.
- peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical report, Allen Institute for AI, 2023. https://github.com/allenai/pes2o.
- RoFormer: Enhanced transformer with rotary position embedding. Preprint arXiv:2104.09864, 2021.
- LLaMA: Open and efficient foundation language models. Preprint arXiv:2302.13971, 2023.
- Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008, Long Beach, CA, USA, December 2017.
- Mixture of attention heads: Selecting attention heads per token. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 4150–4162, Abu Dhabi, United Arab Emirates, December 2022.
- Róbert Csordás (25 papers)
- Piotr Piękos (6 papers)
- Kazuki Irie (35 papers)
- Jürgen Schmidhuber (124 papers)