MoEUT: Mixture-of-Experts Universal Transformers (2405.16039v2)
Abstract: Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as LLMing. Here we propose MoEUT (pronounced "moot"), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in MoEs for both feedforward and attention layers of standard Transformers together with novel layer-normalization and grouping schemes that are specific and crucial to UTs. The resulting UT model, for the first time, slightly outperforms standard Transformers on LLMing tasks such as BLiMP and PIQA, while using significantly less compute and memory.
- Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 5998–6008, Long Beach, CA, USA, December 2017.
- Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131–139, 1992.
- Language models are unsupervised multitask learners. 2019.
- Tom B Brown et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only, December 2020.
- OpenAI. ChatGPT. https://openai.com/blog/chatgpt, 2022.
- OpenAI. GPT-4 technical report. Preprint arXiv:2303.08774, 2023.
- LLaMA: Open and efficient foundation language models. Preprint arXiv:2302.13971, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only, May 2021.
- Decision transformer: Reinforcement learning via sequence modeling. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 15084–15097, Virtual only, December 2021.
- Universal Transformers. In Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA, May 2019.
- Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
- Michael I Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proc. Conf. of the Cognitive Science Society, pages 531–546. Amherst, MA, USA, August 1986.
- Long short-term memory. Neural computation, pages 1735–1780, 1997.
- Making transformers solve compositional tasks. In Proc. Association for Computational Linguistics (ACL), pages 3591–3607, Dublin, Ireland, May 2022.
- The devil is in the detail: Simple tricks improve systematic generalization of Transformers. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, November 2021.
- The neural data router: Adaptive control flow in transformers improves systematic generalization. In Int. Conf. on Learning Representations (ICLR), Virtual only, April 2022.
- Jürgen Schmidhuber. Self-delimiting neural networks. Preprint arXiv:1210.0118, 2012.
- Alex Graves. Adaptive computation time for recurrent neural networks. In Int. Conf. on Learning Representations (ICLR) Workshop Track, Vancouver, Canada, April 2016.
- Scaling laws for neural language models. Preprint arXiv:2001.08361, 2020.
- Scaling laws vs model architectures: How does inductive bias influence scaling? In Findings of the Association for Computational Linguistics: EMNLP, Singapore, December 2023.
- Adaptive mixtures of local experts. Neural Compututaion, 3(1):79–87, 1991.
- John B. Hampshire II and Alexander H. Waibel. The meta-pi network: connectionist rapid adaptation for high-performance multi-speaker phoneme recognition. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 165–168, Albuquerque, New Mexico, USA, April 1990.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Int. Conf. on Learning Representations (ICLR), Toulon, France, April 2017.
- GShard: Scaling giant models with conditional computation and automatic sharding. In Int. Conf. on Learning Representations (ICLR), Virtual only, May 2021.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Preprint arXiv:2101.03961, 2021.
- Unified scaling laws for routed language models. Preprint arXiv:2202.01169, 2022.
- Mixture of attention heads: Selecting attention heads per token. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 4150–4162, Abu Dhabi, United Arab Emirates, December 2022.
- Approximating two-layer feedforward networks for efficient transformers. In Findings of the Association for Computational Linguistics: EMNLP 2023, November 2023a.
- Scaling laws for fine-grained mixture of experts. Preprint arXiv:2402.07871, 2024.
- DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. Preprint arXiv:2401.06066, 2024.
- SwitchHead: Accelerating transformers with mixture-of-experts attention. In Preprint arXiv:2312.07987, December 2023b.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- On layer normalization in the transformer architecture. In Proc. Int. Conf. on Machine Learning (ICML), volume 119, pages 10524–10533, Virtual Only, July 2020.
- Identity mappings in deep residual networks. In Proc. European Conf. on Computer Vision (ECCV), pages 630–645, Amsterdam, Netherlands, October 2016.
- Layer normalization. Preprint arXiv:1607.06450, 2016.
- Simon J. Thorpe. Local vs. distributed coding. Intellectica, 8:3–40, 1989.
- Toy models of superposition. Transformer Circuits Thread, 2022.
- Sparse universal transformer. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 169–179, Singapore, December 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21:140:1–140:67, 2020.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama, June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical report, Allen Institute for AI, 2023. https://github.com/allenai/pes2o.
- The Stack: 3 TB of permissively licensed source code. Preprint arXiv:2211.15533, 2022.
- RoFormer: Enhanced transformer with rotary position embedding. Preprint arXiv:2104.09864, 2021.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proc. Association for Computational Linguistics (ACL), Berlin, Germany, August 2016.
- BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics (TACL), 8:377–392, 2020.
- The Goldilocks principle: Reading children’s books with explicit memory representations. In Int. Conf. on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016.
- Hellaswag: Can a machine really finish your sentence? In Proc. Association for Computational Linguistics (ACL), pages 4791–4800, Florence, Italy, August 2019.
- PIQA: reasoning about physical commonsense in natural language. In Proc. AAAI Conf. on Artificial Intelligence, pages 7432–7439, New York, NY, USA, February 2020. AAAI Press.
- Think you have solved question answering? try ARC, the AI2 reasoning challenge. Preprint arXiv:1803.05457, 2018.
- ALBERT: A lite BERT for self-supervised learning of language representations. In Int. Conf. on Learning Representations (ICLR), Virtual only, April 2020.
- BERT: pre-training of deep bidirectional Transformers for language understanding. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), pages 4171–4186, Minneapolis, MN, USA, June 2019.
- Systematic generalization with edge transformers. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 1390–1402, Virtual only, December 2021.
- Residual: Transformer with dual residual connections. Preprint arXiv:2304.14802, 2023.
- Lessons on parameter sharing across layers in transformers. In Nafise Sadat Moosavi, Iryna Gurevych, Yufang Hou, Gyuwan Kim, Young Jin Kim, Tal Schuster, and Ameeta Agrawal, editors, SustaiNLP Workshop, pages 78–90, Toronto, Canada, July 2023. Association for Computational Linguistics.
- SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling. Preprint arXiv:2312.15166, 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, Louisiana, USA, December 2022.
- PyTorch: An imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 8024–8035, Vancouver, Canada, December 2019.
- Decoupled weight decay regularization. In Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA, May 2019.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 66–71, Brussels, Belgium, October 2018.
- Róbert Csordás (25 papers)
- Kazuki Irie (35 papers)
- Jürgen Schmidhuber (124 papers)
- Christopher Potts (113 papers)
- Christopher D. Manning (169 papers)