MoEUT: Mixture-of-Experts Universal Transformers (2405.16039v2)

Published 25 May 2024 in cs.LG, cs.AI, and cs.NE

Abstract: Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as LLMing. Here we propose MoEUT (pronounced "moot"), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in MoEs for both feedforward and attention layers of standard Transformers together with novel layer-normalization and grouping schemes that are specific and crucial to UTs. The resulting UT model, for the first time, slightly outperforms standard Transformers on LLMing tasks such as BLiMP and PIQA, while using significantly less compute and memory.

References (59)

Authors (5)

Róbert Csordás (25 papers)
Kazuki Irie (35 papers)
Jürgen Schmidhuber (124 papers)
Christopher Potts (113 papers)
Christopher D. Manning (169 papers)

Summary

A Novel Approach to Parameter-Efficient Universal Transformers: MoEUT

This paper introduces MoEUT, a mixture-of-experts (MoE)-based shared-layer Transformer architecture designed to address the longstanding limitations of Universal Transformers (UTs) in parameter-dominated tasks such as LLMing. The key innovation lies in combining recent advances in MoEs with unique layer-grouping and layer-normalization techniques tailored specifically for UTs. MoEUT achieves superior performance while maintaining computational and memory efficiency compared to standard Transformer architectures.

Universal Transformers are characterized by parameter sharing across layers, which provides the inherent expressive power of Recurrent Neural Networks (RNNs). Despite their theoretical advantages in compositional generalization, UTs face significant practical challenges due to the reduced parameter count when parameters are shared across layers. This results in an unfavorable parameter-compute ratio, making it difficult to achieve competitive performance in complex LLMing tasks without prohibitive computational resource requirements.

MoEUT addresses these challenges by leveraging MoEs in both feedforward and attention layers of standard Transformers, coupled with two key architectural innovations:

Layer Grouping: Instead of using a single shared layer, MoEUT employs a recurrent stacking of grouped layers. This approach allows more efficient distribution of experts and increased flexibility in managing the number of experts per layer without necessitating excessive computational demands.
Peri-Layernorm Scheme: A novel layer normalization method designed to optimize signal propagation in shared-layer models. This scheme applies layer normalization only before linear layers followed by sigmoid or softmax activations, effectively resolving the residual growth issue seen in conventional pre-layernorm setups.

Experimental Results

The efficacy of MoEUT is demonstrated through comprehensive experiments on various LLMing datasets, including C4, SlimPajama, and peS2o, as well as on The Stack for code generation. Key findings include:

Performance Scaling: MoEUT consistently outperforms dense Transformer models with the same number of parameters across different parameter scales, as shown in Fig. \ref{fig:scaling_param}. The gap in performance generally increases with the scale of the model, highlighting the efficiency of MoEUT in large-scale settings.
Compute Efficiency: In terms of the number of multiply-accumulate (MAC) operations required for training, MoEUT demonstrates significantly higher efficiency compared to dense Transformer baselines, as illustrated in Fig. \ref{fig:scaling_flops}.
Zero-shot Performance: MoEUT maintains competitive zero-shot performance on various downstream tasks such as BLiMP, CBT, Lambada, HellaSwag, PIQA, and ARC-E, often outperforming the baseline Transformer models as seen in Table \ref{tab:more_results}.

Detailed Architectural Insights

Feedforward and Attention MoE Blocks: MoEUT employs $\sigma$ -MoE for its feedforward blocks and adopts SwitchHead for its self-attention layers. These methods allow efficient parameterization and dynamic expert selection, ensuring that computational resources are utilized effectively. The integration of these MoE techniques into UTs, along with the proposed adjustments, result in notable performance gains.

Layer Grouping: The layer-grouping approach in MoEUT, with a typical group size between 2 and 4, enhances model performance by reducing the number of experts per layer and increasing the total number of attention heads. This configuration ensures balanced computational load and preserves the model’s ability to handle complex sequences effectively.

Peri-Layernorm Scheme: The peri-layernorm scheme resolves the residual norm growth issue without sacrificing gradient flow, which is crucial for training deep models. This method circumvents the limitations of both pre-layernorm and post-layernorm, ensuring efficient signal propagation through shared-layer architectures.

Analysis and Implications

The paper further investigates the expert selection dynamics within MoEUT models. Key findings include:

Expert Reuse Across Layers: Analysis reveals that MoEUT can dynamically assign experts to different layers based on the computational requirements, as shown in Fig. \ref{fig:expert_dist}. This flexibility underscores the model’s ability to adapt to various contexts and tasks.
Expert Diversity: The diversity of expert selection across different tokens and contexts indicates that MoEUT effectively utilizes its expert pool, enhancing the model's adaptability and performance, as seen in Fig. \ref{fig:expert_per_token}.
Dynamic Expert Selection: Individual column analysis shows that expert selection is dynamic and context-dependent, allowing MoEUT to maintain high performance across varied inputs, as depicted in Fig. \ref{fig:iou_layer}.

Future Directions

MoEUT sets a new precedent for Universal Transformers' capability in large-scale LLMing tasks. Future research avenues include:

Optimizing the CUDA kernel implementation to enhance training and inference speeds.
Exploring larger-scale experiments to further validate MoEUT’s advantages in extensive computational settings.
Investigating the application of MoEUT in additional domains beyond LLMing and code generation, potentially including image processing and reinforcement learning.

In conclusion, MoEUT represents a significant advancement in the development of parameter-efficient Universal Transformers, achieving competitive performance with reduced computational costs. This work not only addresses the fundamental limitations of traditional UT architectures but also opens new pathways for scalable and efficient neural network models.

Related Papers

Find Related Papers

Tweets

https://twitter.com/kalomaze/status/1869132403320762665

https://twitter.com/arankomatsuzaki/status/1795299086779941261

https://twitter.com/robert_csordas/status/1927381421649178703

https://twitter.com/stanfordnlp/status/1803105381834764453

https://twitter.com/fly51fly/status/1797195827557097688

https://twitter.com/wzihanw/status/1898196560388731087

YouTube

Show All Videos

HackerNews

Mixture-of-Experts Universal Transformers (1 point, 0 comments)