Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
48 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Yuan 2.0-M32: Mixture of Experts with Attention Router (2405.17976v2)

Published 28 May 2024 in cs.AI and cs.CL

Abstract: Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. A new router network, Attention Router, is proposed and adopted for a more efficient selection of experts, which improves the accuracy compared to the model with classical router network. Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively. The models and source codes of Yuan 2.0-M32 are released at Github1.

Citations (4)

Summary

  • The paper introduces an innovative Attention Router that selects expert subsets based on inter-expert correlations, improving accuracy by 3.8%.
  • The paper demonstrates competitive performance in tasks like coding, mathematics, and general knowledge using only 3.7 billion active parameters.
  • The paper reveals that the model’s training computation is only 9.25% of a similarly scaled dense network, offering significant efficiency gains.

Yuan 2.0-M32: Mixture of Experts with Attention Router

The paper "Yuan 2.0-M32: Mixture of Experts with Attention Router" by Shaohua Wu et al. presents a novel approach to enhancing Mixture of Experts (MoE) architectures in LLMs. This work introduces the Attention Router for expert selection, which significantly improves model performance and computational efficiency. The following provides an expert summary and analysis of the paper's content.

Model Overview

Yuan 2.0-M32 is derived from the Yuan 2.0-2B model and features a MoE architecture with 32 experts, of which 2 are activated during inference. The proposed Attention Router selectively activates experts based on an attention mechanism that considers the correlations between experts, as opposed to the traditionally used classical router network. This approach leads to an accuracy improvement of 3.8% over models using classical routing methods. The Yuan 2.0-M32 boasts competitive performance across various domains such as coding, mathematics, and general knowledge, with only 3.7 billion active parameters from a pool of 40 billion total parameters.

Numerical Results and Benchmarks

The model's efficiency and accuracy are underscored by substantial numerical results:

  • Training Efficiency: The training computation consumption is only 9.25% of a similarly scaled dense model.
  • Performance Metrics: Yuan 2.0-M32 demonstrates strong performance on several benchmarks:
    • MATH: Achieves 55.89 accuracy, surpassing Llama3-70B’s 50.4.
    • ARC-Challenge: Scores 95.8 compared to Llama3-70B’s 93.3.
    • HumanEval (Code Generation): Attains 74.4 zero-shot accuracy, second only to DeepseekV2 and Llama3-70B.

Attention Router

The core innovation in Yuan 2.0-M32 is the Attention Router. Traditional routing methods in MoE architectures typically ignore the relationships between experts, selecting them independently based on a dot product between tokens and expert features (Shazeer et al., 2017). The Attention Router, however, builds a coefficient matrix that captures the inter-expert correlations, leading to more informed and effective expert selection. The architecture demonstrated a reduction in test loss by 3.8% compared to the classical router.

Training and Data

Yuan 2.0-M32 is trained with 2000 billion tokens using a combination of data parallelism and pipeline parallelism, avoiding tensor or optimizer parallelism. The pre-training and fine-tuning datasets are extensive and diversified, including bilingual data covering web crawled content, academic texts, code repositories, and domain-specific datasets. This comprehensive dataset contributes to the model's robust performance across multiple domains.

Implications and Future Developments

The introduction of the Attention Router in Yuan 2.0-M32 has significant theoretical and practical implications for the development of LLMs:

  • Theoretical: This work highlights the importance of considering expert correlations in MoE structures, potentially influencing future designs of routing mechanisms in sparse models.
  • Practical: The enhanced computational efficiency and accuracy, particularly in resource-constrained settings, make Yuan 2.0-M32 a viable option for deployment in various real-world applications where computational resources are limited.

The paper's insights open avenues for further exploration in optimizing MoE architectures. Future developments may involve refining the attention mechanism, exploring adaptive routing strategies, and expanding the scope of the models to other complex tasks and languages.

Conclusion

In summary, the Yuan 2.0-M32 model represents a significant advancement in the efficient application of MoE structures within LLMs. By integrating the Attention Router, it achieves high performance and computational efficiency, with potential broad impacts on both theoretical research and practical applications in AI. The release of models and source codes on GitHub promotes further development and innovation in the field of LLMs, encouraging collaboration and progress in optimizing MoE frameworks.

References

  1. Shazeer, N., Mirhoseini, A., Maziarz, K., et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer". arXiv preprint (1701.06538).
  2. Wu, S., Zhao, X., Luo, J., et al. (2023). "YUAN 2.0: A LLM with Localized Filtering-based Attention". arXiv preprint (Wu et al., 2023).
  3. DeepSeek-AI et al. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts LLM". arXiv preprint (DeepSeek-AI et al., 7 May 2024).
  4. Meta, A.I. (2024). "Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date". Meta AI Blog.