Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
48 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (2411.02265v3)

Published 4 Nov 2024 in cs.CL and cs.AI

Abstract: In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  2. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024.
  4. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP, 2023.
  5. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  7. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of AAAI, 2020.
  8. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  10. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457, 2018.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024.
  13. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
  14. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783, 2024.
  15. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/2404.04475.
  16. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 2022.
  17. How to train long-context language models (effectively). arXiv preprint arXiv:2410.02660, 2024.
  18. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  19. Accurate, large minibatch SGD: Training Imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  20. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  21. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  22. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
  23. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Proceedings of NeurIPS, 2024.
  24. Jamba-1.5: Hybrid transformer-mamba models at scale. arXiv preprint arXiv:2408.12570, 2024.
  25. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  26. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  27. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  28. Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
  29. Natural questions: a benchmark for question answering research. TACL, 2019.
  30. GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  31. CMMLU: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
  32. Surge phenomenon in optimal learning rate and batch size scaling. arXiv preprint arXiv:2405.14578, 2024a.
  33. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024b. URL https://arxiv.org/abs/2406.11939.
  34. Best practices and lessons learned on synthetic data. In Proceedings of COLM, 2024.
  35. AlignBench: Benchmarking chinese alignment of large language models. arXiv preprint arXiv:2311.18743, 2023.
  36. Decoupled weight decay regularization. In Proceedings of ICLR, 2019.
  37. LLM Critics Help Catch LLM Bugs. arXiv preprint arXiv:2407.00215, 2024.
  38. Mistral. Cheaper, better, faster, stronger. continuing to push the frontier of AI and making it accessible to all. 2024. URL https://mistral.ai/news/mixtral-8x22b.
  39. OpenAI. Introducing ChatGPT. 2022. URL https://openai.com/index/chatgpt/.
  40. OpenAI. Tiktoken. 2023. URL https://github.com/openai/tiktoken.
  41. OpenAI. Hello GPT-4o. 2024. URL https://openai.com/index/hello-gpt-4o/.
  42. Training language models to follow instructions with human feedback. In Proceedings of NeurIPS, 2022.
  43. Qwen. Qwen2.5. 2024a. URL https://github.com/QwenLM/Qwen2.5.
  44. Qwen. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters, 2024b. URL https://qwenlm.github.io/blog/qwen-moe/.
  45. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of NeurIPS, 2024.
  46. WinoGrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 2021.
  47. Shazeer, N. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  48. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
  49. Investigating prior knowledge for challenging chinese machine reading comprehension. TACL, 2020.
  50. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  51. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of NAACL, 2019.
  52. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
  53. Attention is all you need. In Proceedings of NIPS, 2017.
  54. HMoE: Heterogeneous mixture of experts for language modeling. arXiv preprint arXiv:2408.10681, 2024a.
  55. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In Proceedings of NeurIPS, 2024b.
  56. CMATH: Can your language model pass chinese elementary school math test? arXiv preprint arXiv:2306.16636, 2023.
  57. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  58. QWen2 Technical Report. arXiv preprint arXiv:2407.10671, 2024.
  59. LV-Eval: A balanced long-context benchmark with 5 length levels up to 256k. arXiv preprint arXiv:2402.05136, 2024.
  60. HellaSwag: Can a machine really finish your sentence? In Proceedings of ACL, 2019.
  61. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of NeurIPS, 2023.
  62. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
Citations (6)

Summary

  • The paper presents Hunyuan-Large, an open-source MoE model with 52B activated parameters that achieves state-of-the-art performance across multiple benchmarks.
  • It employs advanced techniques such as synthetic data integration, dynamic routing, and expert-specific learning rates to optimize scalability and efficiency.
  • Experimental results demonstrate that Hunyuan-Large surpasses even larger models in long-context processing, commonsense understanding, and complex reasoning tasks.

An Expert Review of "Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent"

The paper introduces Hunyuan-Large, an advanced open-source Transformer-based mixture of experts (MoE) model, developed by the Tencent Hunyuan Team. This model has been designed with a substantial scale of 389 billion total parameters and 52 billion activated parameters, placing it among the largest models available in the open-source community. Its architectural design allows it to process sequences of up to 256,000 tokens, outperforming models such as LLama3.1-70B in several key benchmarks, and equating LLama3.1-405B in terms of effectiveness.

Key Attributes and Achievements

Hunyuan-Large's competence spans multiple domains, including NLP, mathematical reasoning, and long-context tasks. Evaluations reportedly show superior performance over both dense and MoE models of comparable or larger parameter sizes. Specifically, the model demonstrates robust capabilities in commonsense understanding, mathematics, and coding, notably outperforming baselines in datasets like MMLU, MATH, and CMMLU.

Technical innovations play a crucial role in the model's performance gains. These include the incorporation of large-scale synthetic data, advanced routing strategies, key-value cache compression, and expert-specific learning rate strategies. These techniques contribute to a balanced load across experts, optimized training schedules, and controlled computational costs, all of which facilitate the model's scalability and efficiency.

Experimental Framework

The research thoroughly explores the scaling laws of MoE models, drawing insights that guide the strategic planning of model dimensions and training requirements. This understanding leads to well-informed decisions in parameter configurations and training schedules, ensuring optimal performance aligned with practical computational constraints.

In pre-training, the authors integrate a diverse and voluminous dataset. The model utilizes both natural and synthetic data sources, incorporating nearly 1.5 trillion high-quality synthetic tokens to bolster learning versatility and generalization prowess.

Post-training involves a two-phase instruction alignment process. Supervised fine-tuning (SFT) on over a million high-quality data samples refines task-specific capabilities. Reinforcement learning from human feedback (RLHF) further aligns model outputs with human preferences, employing Direct Preference Optimization (DPO) to optimize the alignment process.

Implications and Future Directions

The implications of Hunyuan-Large are substantial for both practical applications and theoretical advancements in AI model design. The release of such a large-scale, high-performance MoE model provides a valuable asset for the AI research community, opening opportunities for further exploration and innovation in handling expansive contextual depths and intricate reasoning tasks.

The paper suggests that the synergy between high-quality synthetic data and innovative model optimization techniques can significantly enhance the MoE models' functional quotas. For future models, deeper investigations into specific parameter regimes and scheduling could yield additional gains, potentially simplifying the path towards achieving more robust model generalization and personalization.

In conclusion, Hunyuan-Large stands as a demonstrative achievement in MoE model architecture and training strategy, setting a substantial benchmark for future explorations within the LLM domain. The release of its code and checkpoints positions it as an instrument for ongoing discourse and experimentation within the AI research landscape.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit