Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Llama 3 Meets MoE: Efficient Upcycling (2412.09952v1)

Published 13 Dec 2024 in cs.LG

Abstract: Scaling LLMs significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1\%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $\textbf{2%}$ improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of $\textbf{46.8%}$ during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.

The paper presents a technically detailed methodology for efficiently upcycling dense LLMs into Mixture-of-Experts (MoE) architectures, specifically converting a pre-trained Llama 3-8B checkpoint into an 8-Expert, Top-2 MoE model. The work leverages the observation that while increasing model capacity significantly improves performance on downstream tasks, traditional dense model scaling is computationally prohibitive. By reusing already invested pre-training costs, the method achieves enhanced capacity with less than 1% of the compute typically required for training an MoE model from scratch.

The key contributions and technical details are as follows:

Dense-to-MoE Upcycling Procedure

  • The approach involves replicating a subset of the dense model’s feed-forward layers across multiple experts. In practice, each dense feed-forward layer is copied exactly to initialize all experts, and a new gating network, randomly initialized, is introduced.
  • The MoE operation is formally given by
    • yy: output of the MoE layer,
    • G(x)iG(x)_i: gating score for the ithi^{th} expert,
    • Ei(x)E_i(x): output of the ithi^{th} expert,
    • NN: number of experts.
  • The paper adopts a Noisy Top-k gating mechanism, meaning that only the top kk experts (specifically, Top-2 in this instance) are activated per token. This is achieved using a two-step process where noisy projections are computed and then the top kk scores are retained via a KeepTopK operator, followed by Softmax normalization.

Training Framework and Parallelism Strategies

  • The methodology exploits a 5-D hybrid parallelism paradigm incorporating Tensor Parallelism (TP), Expert Parallelism (EP), Pipeline Parallelism (PP), Context Parallelism (CP), and Data Parallelism (DP). This design is critical in efficiently handling the dramatic increase in parameters resulting from the MoE conversion while keeping inter-device communication within high-bandwidth networks (e.g., NVLink).
  • A novel strategy termed MoE Parallel Folding is introduced. This decouples the Attention module and the MoE component of the Transformer architecture so that these components can possess independent parallelism configurations. For example, the Attention layer might use a TP×CP group (e.g., TP2CP2), while the MoE layer uses a TP×EP group (e.g., TP1EP8). This separation effectively folds communication-intensive patterns together within a single node, substantially mitigating latency.

Empirical Training Insights and Ablation Studies

  • The model is trained on a blended dataset comprising approximately 0.89 trillion tokens (from deduplicated and filtered RedPajama V2 pretraining data) and roughly 2.7 billion tokens from academic benchmarks, using a 7:3 ratio.
  • Quantitatively, the upcycled MoE model exhibits a 2% improvement in 0-shot accuracy on the MMLU benchmark and around a 1.2% boost in overall normalized accuracy compared to the original dense Llama 3-8B model.
  • Two main ablation studies are presented:
    • Choice of Capacity Factor (CF): The capacity factor, which regulates the average token load per expert, was varied (including dropless training with an infinite CF). The experiments show that while lower CF settings (e.g., CF=1) yield a higher Model FLOPs Utilization (MFU up to 46.8%), using CF values of 2 or 4 resulted in better downstream task performance due to the implicit regularization effects. A CF=4 was ultimately chosen to balance MFU and MMLU performance.
    • Routing Algorithm Variants: The paper contrasts two routing strategies. The Mixtral-type router, which applies KeepTopK before the Softmax operation, preserves the output characteristics of the dense model during the initial forward pass and leads to faster convergence. In comparison, reversing the order (Softmax before KeepTopK) introduced instability and slower convergence, justifying the selection of the Mixtral configuration.

Implementation and Computational Efficiency

  • The online upcycling approach is integrated into the NeMo framework, where dense checkpoints are sharded according to the parallel training configuration. This avoids the pitfalls of handling exceedingly large weight matrices across devices, eliminating extra cross-device weight copying.
  • From a computational standpoint, the upcycling process consumed roughly 11K GPU hours on 512 H100 GPUs using bfloat16 precision, compared to an estimated 1.6 million GPU hours that would have been required to train a similar MoE model from scratch using the complete dense pre-training dataset.

Conclusion

The paper provides a comprehensive recipe for efficiently transforming pre-trained dense LLMs into high-capacity MoE architectures. By focusing on the strategic reuse of dense model parameters, meticulous tuning of parallelism configurations, and detailed ablation studies on capacity factors and routing operations, the work demonstrates that significant performance improvements on challenging downstream tasks can be achieved at a fraction of the computational expense. This makes the development of high-capacity models more accessible, thereby lowering the barrier to entry for research groups with limited compute resources.

This comprehensive summary should provide a clear technical overview of the methodology and results presented in the paper.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Aditya Vavre (3 papers)
  2. Ethan He (5 papers)
  3. Dennis Liu (2 papers)
  4. Zijie Yan (10 papers)
  5. June Yang (3 papers)
  6. Nima Tajbakhsh (21 papers)
  7. Ashwath Aithal (12 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com