BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts (2408.08274v2)

Published 15 Aug 2024 in cs.LG

Abstract: The Mixture of Experts (MoE) framework has become a popular architecture for LLMs due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.

PDF HTML Abstract

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

The paper "BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts" by Zhang et al. proposes an innovative procedure to efficiently transform dense pre-trained models into a Mixture of Experts (MoE) structure while making full use of both feed-forward network (FFN) and attention parameters. The framework, termed as Branch-Attend-Mix (BAM), extends existing methodologies that solely upcycle FFN layers by also incorporating attention layers, thereby maximizing parameter utilization and improving model performance under equivalent computational budgets.

Introduction

MoE architectures have emerged as a competitive alternative to dense models in LLM implementations. They offer a mechanism to scale model capacity without proportional growth in computational cost by activating only a subset of parameters for each input. However, training MoEs from the ground up remains computationally intensive and prone to instability. Prior efforts have mitigated this by initializing MoEs through pre-trained dense models but largely restrict parameter initialization to FFN layers only, hence underutilizing attention parameter information available in dense models.

BAM: Branch-Attend-Mix Framework

The proposed BAM framework introduces an effective methodology for the complete utilization of pre-trained dense model parameters when initializing an MoE. BAM enhances the Branch-Train-MiX (BTX) approach by incorporating the following key innovations:

Full Utilization of Attention Experts: Unlike previous methods that average attention parameters across dense models, BAM initializes distinct attention experts using specialized attention parameters from each dense model, thereby preserving specialized knowledge.
Parallel Computation of Experts: Employing a parallel attention transformer architecture, BAM computes attention and FFN experts concurrently, which improves training throughput.
Soft Routing for Enhanced Stability: Soft routing is applied to assign each token to all attention experts, mitigating the risk of gradient instabilities and ensuring balanced parameter utilization.

Experimental Validation

The efficacy of BAM was demonstrated across models ranging from 590 million to 2 billion parameters. These experiments covered both small-scale and large-scale settings, ensuring robust evaluation across different model capacities. Key observations include:

Perplexity: BAM consistently outperformed BTX and dense baselines in terms of perplexity across different domains like code, law, and mathematics.
Downstream Task Performance: On large models, BAM showed superior performance on benchmark tasks involving mathematical reasoning, code generation, and legal question answering, which indicated the practical utility of the specialized knowledge retained in the attention experts.

Ablation Studies

Several ablation studies were conducted to dissect critical components influencing BAM’s performance:

Total Versus Active Parameters: BAM demonstrated superior performance despite having equivalent or even fewer active parameters compared to parameter-matched BTX models.
Soft Routing: Experiments indicated that soft routing in attention layers was crucial for BAM’s enhanced performance. Other routing mechanisms like top-1 and top-2 routing failed to match the effectiveness of soft routing.
Inference Efficiency: While BAM's inference FLOPs were slightly higher than those of standard BTX due to soft-routing, the adoption of a parallel attention transformer architecture partly masked the increased computational load.

Implications and Future Work

The implications of BAM are significant for both theoretical advancements and practical deployments of large-scale LLMs. By fully leveraging specialized dense models, BAM enhances the modularity and scalability of MoEs, which is pivotal for realizing high-performance models under computational constraints. This approach also democratizes advanced model structures by allowing extensive reuse of pre-trained dense models available in open-source communities.

Future work could focus on optimizing the training data mixture for BAM’s three phases and enhancing the framework to further reduce training and inference times. Additionally, exploring diverse routing mechanisms might yield further improvements in expert utilization efficiency.

Conclusion

BAM offers a robust, efficient method for parameter upcycling, surpassing existing approaches by fully leveraging the parameters of specialized dense models. The proposed soft-routing mechanism in attention experts, combined with concurrent computation capabilities, brings forth a substantial improvement in model performance without necessitating additional computational resources. The research demonstrates a clear path forward for further innovations in MoE architectures, particularly in leveraging pre-trained dense models to their fullest potential.

The paper by Zhang et al. is a noteworthy contribution to the ongoing exploration of scalable and efficient LLMs, presenting actionable insights and strategies that can be readily applied to future developments in the field.