MoEfication: Transformer Feed-forward Layers are Mixtures of Experts (2110.01786v3)

Published 5 Oct 2021 in cs.CL

Abstract: Recent work has shown that feed-forward networks (FFNs) in pre-trained Transformers are a key component, storing various linguistic and factual knowledge. However, the computational patterns of FFNs are still unclear. In this work, we study the computational patterns of FFNs and observe that most inputs only activate a tiny ratio of neurons of FFNs. This phenomenon is similar to the sparsity of the human brain, which drives research on functional partitions of the human brain. To verify whether functional partitions also emerge in FFNs, we propose to convert a model into its MoE version with the same parameters, namely MoEfication. Specifically, MoEfication consists of two phases: (1) splitting the parameters of FFNs into multiple functional partitions as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that MoEfication can conditionally use 10% to 30% of FFN parameters while maintaining over 95% original performance for different models on various downstream tasks. Besides, MoEfication brings two advantages: (1) it significantly reduces the FLOPS of inference, i.e., 2x speedup with 25% of FFN parameters, and (2) it provides a fine-grained perspective to study the inner mechanism of FFNs. The source code of this paper can be obtained from https://github.com/thunlp/MoEfication.

Citations (100)

View on Semantic Scholar

Summary

The paper introduces MoEfication, transforming Transformer FFN layers into mixtures of experts to leverage sparse activation patterns.
It partitions FFN parameters into expert groups using co-activation statistics and employs dynamic routing based on input features.
Experiments show that activating only 10-30% of parameters retains over 95% performance while achieving up to 2x speedup.

An Analysis of "MoEfication: Transformer Feed-forward Layers are Mixtures of Experts"

The paper "MoEfication: Transformer Feed-forward Layers are Mixtures of Experts" explores an innovative approach to transforming traditional Transformer models by leveraging the underutilized potentials within the Feed-Forward Networks (FFNs). Authors Zhang et al. propose a method known as MoEfication, which reconceptualizes the dense FFN layers in Transformers as a Mixture-of-Experts (MoE) framework, then adaptively utilizes these expert networks during inference.

Core Concept and Methodology

The central thesis of the paper is hinged on the observation of sparse activation within Transformer FFNs — a phenomenon where only a small subset of neurons in FFNs is actually active for any given input. This sparse activation in neural models mirrors the functional partitioning found in biological brains, particularly in the context of localized brain activities.

MoEfication Framework: The authors introduce the process of MoEfication, which unfolds in two main stages:

Expert Construction: The parameters of FFNs are partitioned into separate groups termed as "experts." These groups are formed using methods such as Co-Activation Graph Split, which generates partitions based on co-activation statistics observed in the FFNs, thereby ensuring that neurons likely to activate together are aggregated into a single expert.
Expert Selection: For each input, a dynamic router determines which experts to activate, conditioned on the characteristics of the input representation. Multiple routing strategies are explored, including MLP-based selection and cosine similarity measures, to optimize the selection process.

The paper reports that MoEfication allows the selective activation of only 10-30% of FFN parameters for given tasks, while preserving over 95% of the original model's performance.

Empirical Results and Observations

Experimental evaluations are comprehensively conducted on a variety of NLP benchmarks including GLUE, SQuAD, and RACE using multiple Transformer models (e.g., T5 and BERT variants). Key findings include:

Sparsity and Model Size: Larger models (e.g., T5-XLarge) exhibit more pronounced sparsity, thus benefiting more significantly from MoEfication in terms of performance retention when using a reduced number of active parameters.
Efficiency Gains: Using 25% of FFN parameters, the method achieves approximately 2x speedup in floating point operations per second (FLOPS) with a corresponding speed efficiency observed on CPU and GPU infrastructures.

Implications and Future Directions

The MoEfication approach opens several avenues for both theoretical exploration and practical application:

Practical Efficiency in PLMs: By reducing the computational overhead while maintaining performance, MoEfication is beneficial for deploying LLMs in resource-constrained environments.
Interpretability of Neural Networks: The expert-based partitioning leads to insights into the inner workings of FFNs, highlighting which subsets of neurons contribute to specific tasks or linguistic functions.
Optimization Strategies: Future exploration could focus on optimizing the learning and inference strategies of MoE, leveraging new insights on routing and activation patterns.

Conclusion

Overall, the work by Zhang et al. contributes a significant perspective to the understanding and optimization of Transformer models via MoEfication. While the presented results establish a solid foundation, future research could delve into enhancing router efficiency and further reducing the activation footprint without compromising model expressiveness. The approach aligns well with ongoing efforts to make AI models not only more powerful but also more efficient and interpretable.

PDF Markdown

Related Papers

GitHub

GitHub - thunlp/MoEfication (94 stars)