Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoEUT: Mixture-of-Experts Universal Transformers (2405.16039v2)

Published 25 May 2024 in cs.LG, cs.AI, and cs.NE

Abstract: Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as LLMing. Here we propose MoEUT (pronounced "moot"), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in MoEs for both feedforward and attention layers of standard Transformers together with novel layer-normalization and grouping schemes that are specific and crucial to UTs. The resulting UT model, for the first time, slightly outperforms standard Transformers on LLMing tasks such as BLiMP and PIQA, while using significantly less compute and memory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 5998–6008, Long Beach, CA, USA, December 2017.
  2. Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131–139, 1992.
  3. Language models are unsupervised multitask learners. 2019.
  4. Tom B Brown et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only, December 2020.
  5. OpenAI. ChatGPT. https://openai.com/blog/chatgpt, 2022.
  6. OpenAI. GPT-4 technical report. Preprint arXiv:2303.08774, 2023.
  7. LLaMA: Open and efficient foundation language models. Preprint arXiv:2302.13971, 2023.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only, May 2021.
  9. Decision transformer: Reinforcement learning via sequence modeling. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 15084–15097, Virtual only, December 2021.
  10. Universal Transformers. In Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA, May 2019.
  11. Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
  12. Michael I Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proc. Conf. of the Cognitive Science Society, pages 531–546. Amherst, MA, USA, August 1986.
  13. Long short-term memory. Neural computation, pages 1735–1780, 1997.
  14. Making transformers solve compositional tasks. In Proc. Association for Computational Linguistics (ACL), pages 3591–3607, Dublin, Ireland, May 2022.
  15. The devil is in the detail: Simple tricks improve systematic generalization of Transformers. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, November 2021.
  16. The neural data router: Adaptive control flow in transformers improves systematic generalization. In Int. Conf. on Learning Representations (ICLR), Virtual only, April 2022.
  17. Jürgen Schmidhuber. Self-delimiting neural networks. Preprint arXiv:1210.0118, 2012.
  18. Alex Graves. Adaptive computation time for recurrent neural networks. In Int. Conf. on Learning Representations (ICLR) Workshop Track, Vancouver, Canada, April 2016.
  19. Scaling laws for neural language models. Preprint arXiv:2001.08361, 2020.
  20. Scaling laws vs model architectures: How does inductive bias influence scaling? In Findings of the Association for Computational Linguistics: EMNLP, Singapore, December 2023.
  21. Adaptive mixtures of local experts. Neural Compututaion, 3(1):79–87, 1991.
  22. John B. Hampshire II and Alexander H. Waibel. The meta-pi network: connectionist rapid adaptation for high-performance multi-speaker phoneme recognition. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 165–168, Albuquerque, New Mexico, USA, April 1990.
  23. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Int. Conf. on Learning Representations (ICLR), Toulon, France, April 2017.
  24. GShard: Scaling giant models with conditional computation and automatic sharding. In Int. Conf. on Learning Representations (ICLR), Virtual only, May 2021.
  25. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Preprint arXiv:2101.03961, 2021.
  26. Unified scaling laws for routed language models. Preprint arXiv:2202.01169, 2022.
  27. Mixture of attention heads: Selecting attention heads per token. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 4150–4162, Abu Dhabi, United Arab Emirates, December 2022.
  28. Approximating two-layer feedforward networks for efficient transformers. In Findings of the Association for Computational Linguistics: EMNLP 2023, November 2023a.
  29. Scaling laws for fine-grained mixture of experts. Preprint arXiv:2402.07871, 2024.
  30. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. Preprint arXiv:2401.06066, 2024.
  31. SwitchHead: Accelerating transformers with mixture-of-experts attention. In Preprint arXiv:2312.07987, December 2023b.
  32. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  33. On layer normalization in the transformer architecture. In Proc. Int. Conf. on Machine Learning (ICML), volume 119, pages 10524–10533, Virtual Only, July 2020.
  34. Identity mappings in deep residual networks. In Proc. European Conf. on Computer Vision (ECCV), pages 630–645, Amsterdam, Netherlands, October 2016.
  35. Layer normalization. Preprint arXiv:1607.06450, 2016.
  36. Simon J. Thorpe. Local vs. distributed coding. Intellectica, 8:3–40, 1989.
  37. Toy models of superposition. Transformer Circuits Thread, 2022.
  38. Sparse universal transformer. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 169–179, Singapore, December 2023.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21:140:1–140:67, 2020.
  40. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama, June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  41. peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical report, Allen Institute for AI, 2023. https://github.com/allenai/pes2o.
  42. The Stack: 3 TB of permissively licensed source code. Preprint arXiv:2211.15533, 2022.
  43. RoFormer: Enhanced transformer with rotary position embedding. Preprint arXiv:2104.09864, 2021.
  44. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proc. Association for Computational Linguistics (ACL), Berlin, Germany, August 2016.
  45. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics (TACL), 8:377–392, 2020.
  46. The Goldilocks principle: Reading children’s books with explicit memory representations. In Int. Conf. on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016.
  47. Hellaswag: Can a machine really finish your sentence? In Proc. Association for Computational Linguistics (ACL), pages 4791–4800, Florence, Italy, August 2019.
  48. PIQA: reasoning about physical commonsense in natural language. In Proc. AAAI Conf. on Artificial Intelligence, pages 7432–7439, New York, NY, USA, February 2020. AAAI Press.
  49. Think you have solved question answering? try ARC, the AI2 reasoning challenge. Preprint arXiv:1803.05457, 2018.
  50. ALBERT: A lite BERT for self-supervised learning of language representations. In Int. Conf. on Learning Representations (ICLR), Virtual only, April 2020.
  51. BERT: pre-training of deep bidirectional Transformers for language understanding. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), pages 4171–4186, Minneapolis, MN, USA, June 2019.
  52. Systematic generalization with edge transformers. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 1390–1402, Virtual only, December 2021.
  53. Residual: Transformer with dual residual connections. Preprint arXiv:2304.14802, 2023.
  54. Lessons on parameter sharing across layers in transformers. In Nafise Sadat Moosavi, Iryna Gurevych, Yufang Hou, Gyuwan Kim, Young Jin Kim, Tal Schuster, and Ameeta Agrawal, editors, SustaiNLP Workshop, pages 78–90, Toronto, Canada, July 2023. Association for Computational Linguistics.
  55. SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling. Preprint arXiv:2312.15166, 2023.
  56. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, Louisiana, USA, December 2022.
  57. PyTorch: An imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 8024–8035, Vancouver, Canada, December 2019.
  58. Decoupled weight decay regularization. In Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA, May 2019.
  59. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 66–71, Brussels, Belgium, October 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Róbert Csordás (25 papers)
  2. Kazuki Irie (35 papers)
  3. Jürgen Schmidhuber (124 papers)
  4. Christopher Potts (113 papers)
  5. Christopher D. Manning (169 papers)

Summary

A Novel Approach to Parameter-Efficient Universal Transformers: MoEUT

This paper introduces MoEUT, a mixture-of-experts (MoE)-based shared-layer Transformer architecture designed to address the longstanding limitations of Universal Transformers (UTs) in parameter-dominated tasks such as LLMing. The key innovation lies in combining recent advances in MoEs with unique layer-grouping and layer-normalization techniques tailored specifically for UTs. MoEUT achieves superior performance while maintaining computational and memory efficiency compared to standard Transformer architectures.

Universal Transformers are characterized by parameter sharing across layers, which provides the inherent expressive power of Recurrent Neural Networks (RNNs). Despite their theoretical advantages in compositional generalization, UTs face significant practical challenges due to the reduced parameter count when parameters are shared across layers. This results in an unfavorable parameter-compute ratio, making it difficult to achieve competitive performance in complex LLMing tasks without prohibitive computational resource requirements.

MoEUT addresses these challenges by leveraging MoEs in both feedforward and attention layers of standard Transformers, coupled with two key architectural innovations:

  1. Layer Grouping: Instead of using a single shared layer, MoEUT employs a recurrent stacking of grouped layers. This approach allows more efficient distribution of experts and increased flexibility in managing the number of experts per layer without necessitating excessive computational demands.
  2. Peri-Layernorm Scheme: A novel layer normalization method designed to optimize signal propagation in shared-layer models. This scheme applies layer normalization only before linear layers followed by sigmoid or softmax activations, effectively resolving the residual growth issue seen in conventional pre-layernorm setups.

Experimental Results

The efficacy of MoEUT is demonstrated through comprehensive experiments on various LLMing datasets, including C4, SlimPajama, and peS2o, as well as on The Stack for code generation. Key findings include:

  • Performance Scaling: MoEUT consistently outperforms dense Transformer models with the same number of parameters across different parameter scales, as shown in Fig. \ref{fig:scaling_param}. The gap in performance generally increases with the scale of the model, highlighting the efficiency of MoEUT in large-scale settings.
  • Compute Efficiency: In terms of the number of multiply-accumulate (MAC) operations required for training, MoEUT demonstrates significantly higher efficiency compared to dense Transformer baselines, as illustrated in Fig. \ref{fig:scaling_flops}.
  • Zero-shot Performance: MoEUT maintains competitive zero-shot performance on various downstream tasks such as BLiMP, CBT, Lambada, HellaSwag, PIQA, and ARC-E, often outperforming the baseline Transformer models as seen in Table \ref{tab:more_results}.

Detailed Architectural Insights

Feedforward and Attention MoE Blocks: MoEUT employs σ\sigma-MoE for its feedforward blocks and adopts SwitchHead for its self-attention layers. These methods allow efficient parameterization and dynamic expert selection, ensuring that computational resources are utilized effectively. The integration of these MoE techniques into UTs, along with the proposed adjustments, result in notable performance gains.

Layer Grouping: The layer-grouping approach in MoEUT, with a typical group size between 2 and 4, enhances model performance by reducing the number of experts per layer and increasing the total number of attention heads. This configuration ensures balanced computational load and preserves the model’s ability to handle complex sequences effectively.

Peri-Layernorm Scheme: The peri-layernorm scheme resolves the residual norm growth issue without sacrificing gradient flow, which is crucial for training deep models. This method circumvents the limitations of both pre-layernorm and post-layernorm, ensuring efficient signal propagation through shared-layer architectures.

Analysis and Implications

The paper further investigates the expert selection dynamics within MoEUT models. Key findings include:

  • Expert Reuse Across Layers: Analysis reveals that MoEUT can dynamically assign experts to different layers based on the computational requirements, as shown in Fig. \ref{fig:expert_dist}. This flexibility underscores the model’s ability to adapt to various contexts and tasks.
  • Expert Diversity: The diversity of expert selection across different tokens and contexts indicates that MoEUT effectively utilizes its expert pool, enhancing the model's adaptability and performance, as seen in Fig. \ref{fig:expert_per_token}.
  • Dynamic Expert Selection: Individual column analysis shows that expert selection is dynamic and context-dependent, allowing MoEUT to maintain high performance across varied inputs, as depicted in Fig. \ref{fig:iou_layer}.

Future Directions

MoEUT sets a new precedent for Universal Transformers' capability in large-scale LLMing tasks. Future research avenues include:

  • Optimizing the CUDA kernel implementation to enhance training and inference speeds.
  • Exploring larger-scale experiments to further validate MoEUT’s advantages in extensive computational settings.
  • Investigating the application of MoEUT in additional domains beyond LLMing and code generation, potentially including image processing and reinforcement learning.

In conclusion, MoEUT represents a significant advancement in the development of parameter-efficient Universal Transformers, achieving competitive performance with reduced computational costs. This work not only addresses the fundamental limitations of traditional UT architectures but also opens new pathways for scalable and efficient neural network models.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews