Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (2411.02265v3)
Abstract: In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP, 2023.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- PIQA: Reasoning about physical commonsense in natural language. In Proceedings of AAAI, 2020.
- Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
- The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783, 2024.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/2404.04475.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 2022.
- How to train long-context language models (effectively). arXiv preprint arXiv:2410.02660, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Accurate, large minibatch SGD: Training Imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
- C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Proceedings of NeurIPS, 2024.
- Jamba-1.5: Hybrid transformer-mamba models at scale. arXiv preprint arXiv:2408.12570, 2024.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
- Natural questions: a benchmark for question answering research. TACL, 2019.
- GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
- CMMLU: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
- Surge phenomenon in optimal learning rate and batch size scaling. arXiv preprint arXiv:2405.14578, 2024a.
- From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024b. URL https://arxiv.org/abs/2406.11939.
- Best practices and lessons learned on synthetic data. In Proceedings of COLM, 2024.
- AlignBench: Benchmarking chinese alignment of large language models. arXiv preprint arXiv:2311.18743, 2023.
- Decoupled weight decay regularization. In Proceedings of ICLR, 2019.
- LLM Critics Help Catch LLM Bugs. arXiv preprint arXiv:2407.00215, 2024.
- Mistral. Cheaper, better, faster, stronger. continuing to push the frontier of AI and making it accessible to all. 2024. URL https://mistral.ai/news/mixtral-8x22b.
- OpenAI. Introducing ChatGPT. 2022. URL https://openai.com/index/chatgpt/.
- OpenAI. Tiktoken. 2023. URL https://github.com/openai/tiktoken.
- OpenAI. Hello GPT-4o. 2024. URL https://openai.com/index/hello-gpt-4o/.
- Training language models to follow instructions with human feedback. In Proceedings of NeurIPS, 2022.
- Qwen. Qwen2.5. 2024a. URL https://github.com/QwenLM/Qwen2.5.
- Qwen. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters, 2024b. URL https://qwenlm.github.io/blog/qwen-moe/.
- Direct preference optimization: Your language model is secretly a reward model. In Proceedings of NeurIPS, 2024.
- WinoGrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 2021.
- Shazeer, N. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
- Investigating prior knowledge for challenging chinese machine reading comprehension. TACL, 2020.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of NAACL, 2019.
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. In Proceedings of NIPS, 2017.
- HMoE: Heterogeneous mixture of experts for language modeling. arXiv preprint arXiv:2408.10681, 2024a.
- MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In Proceedings of NeurIPS, 2024b.
- CMATH: Can your language model pass chinese elementary school math test? arXiv preprint arXiv:2306.16636, 2023.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- QWen2 Technical Report. arXiv preprint arXiv:2407.10671, 2024.
- LV-Eval: A balanced long-context benchmark with 5 length levels up to 256k. arXiv preprint arXiv:2402.05136, 2024.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of ACL, 2019.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of NeurIPS, 2023.
- Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.