Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

JetMoE: Reaching Llama2 Performance with 0.1M Dollars (2404.07413v1)

Published 11 Apr 2024 in cs.CL and cs.AI
JetMoE: Reaching Llama2 Performance with 0.1M Dollars

Abstract: LLMs have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE.

Overview of JetMoE: Reaching Llama2 Performance with $0.1M

The paper "JetMoE: Reaching Llama2 Performance with 0.1M Dollars" presents a comprehensive paper on the development and evaluation of the JetMoE-8B model, a LLM trained under significant budget constraints while achieving competitive performance against well-known models such as Llama2. This paper focuses on the efficient training methodologies and architectural optimizations employed to create a cost-effective model that maintains high performance across a variety of benchmarks.

Introduction

The research addresses a critical issue in the development of LLMs: the increasing computational and financial demands required to achieve state-of-the-art performance. The JetMoE-8B model utilizes a Sparsely-gated Mixture-of-Experts (SMoE) architecture to alleviate these demands. By activating only a subset of the total parameters during training and inference, this approach reduces computational costs significantly. JetMoE-8B incorporates both sparse attention and feedforward layers, activating only 2B parameters out of 8B per input token. This greatly minimizes inference computation compared to other models like Llama2-7B, which use all their parameters simultaneously.

Model Architecture

The architecture of JetMoE-8B is designed to maximize efficiency without compromising performance. It extends the sparse activation technique to both the attention and feed-forward layers, inspired by the ModuleFormer architecture. By doing so, the model efficiently manages computational resources, activating only necessary parameters per input token.

Mixture of Experts

In the JetMoE framework, the Mixture of Experts (MoE) layer is a central feature. Each MoE layer comprises multiple experts and a router to select the top-k experts for each input. The sparse activation reduces the computational load during both training and inference phases.

FeedForward and Attention Experts

The model uses a standard 2-layer MLP for each feedforward expert while the attention experts incorporate innovations like the Mixture of Attention heads (MoA) with RoPE relative positioning. The shared key and value projection matrices across attention experts further enhance efficiency and training stability.

Pretraining and Data Mixture

JetMoE-8B is pretrained on a mixture of open-source datasets spanning web documents, code, and mathematical content. The datasets include RefinedWeb, StarCoder, The Pile, Dolma, and others. The training strategy involves two phases, with an initial phase focused on a broader data mix and a second phase emphasizing high-quality data to fine-tune the model, increasing the weight of high-quality data during the learning rate decay phase.

The training was conducted using the Megatron framework with modifications to support MoA and z-loss. The infrastructure consisted of a cluster with 96 H100 GPUs spread across 12 nodes. Hyperparameters were selected based on empirical results from prior research and set to optimize both performance and computational efficiency.

Model Alignment

JetMoE-8B-Chat is aligned through a two-step process comprising Distilled Supervised Fine-Tuning (dSFT) and Distilled Direct Preference Optimization (dDPO). dSFT involves instruction tuning with data distilled from a teacher model, while dDPO refines this by incorporating teacher model preferences into the reward function. This alignment ensures that JetMoE-8B-Chat achieves a high degree of relevance and coherence in its responses.

Evaluation

The evaluation of JetMoE-8B includes a comparison with several leading models on the OpenLLM leaderboard and other domain-specific benchmarks. JetMoE-8B consistently outperforms or matches the performance of these models despite a lower computational budget. On metrics like Hellaswag, MMLU, and TruthfulQA, JetMoE-8B excels, demonstrating the efficacy of its architecture and training regimen.

Implications and Future Work

This research underscores the potential for creating high-performance LLMs in a cost-effective manner. The adoption of the SMoE architecture proves that significant computational savings can be achieved without a considerable drop in model performance. The described methodologies and open-source nature of JetMoE-8B facilitate further research and collaboration across the AI community.

However, due to budget constraints, the paper lacks ablation experiments that could provide deeper insights into the contributions of various components. Future research could look into optimizing hyperparameters and data mixtures further, potentially improving the performance and efficiency of ensuing models.

Conclusion

JetMoE-8B exemplifies a significant stride towards democratizing access to advanced LLMs by presenting an efficient, open-source approach to training LLMs. The detailed reporting of training parameters and data mixtures provided in this paper fosters reproducibility and further advancements in the field. By balancing cost and performance effectively, JetMoE-8B paves the way for future research aimed at creating accessible and potent AI solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. abacusai. Systemchat, 2024. URL https://huggingface.co/datasets/abacusai/SystemChat.
  2. ajibawa 2023. Code-290k-sharegpt, 2024. URL https://huggingface.co/datasets/ajibawa-2023/Code-290k-ShareGPT.
  3. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation, 2024.
  4. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  5. Llemma: An open language model for mathematics, 2023.
  6. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  7. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
  8. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Evaluating large language models trained on code, 2021.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  12. Deep reinforcement learning from human preferences, 2023.
  13. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  14. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  15. CogStack. OpenGPT: A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs). https://github.com/CogStack/OpenGPT, 2023.
  16. CollectiveCognition. Collective cognition chatgpt conversations, 2023. URL https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-09-22.
  17. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  18. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
  19. Luigi Daniele and Suphavadeeprasit. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(coming soon), 2023. URL https://huggingface.co/datasets/LDJnr/Capybara.
  20. Databricks. Dbrx: Resources and code examples. https://github.com/databricks/dbrx, 2024.
  21. Enhancing chat language models by scaling high-quality instructional conversations, 2023.
  22. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  23. Jon Durbin. airoboros: Customizable implementation of the self-instruct paper. https://github.com/jondurbin/airoboros, 2023.
  24. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021.
  25. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5, 2023.
  26. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  27. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  28. glaiveai. Glaive-code-assistant, 2023. URL https://huggingface.co/datasets/glaiveai/glaive-code-assistant.
  29. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024a.
  30. Direct language model alignment from online ai feedback, 2024b.
  31. Transcending runtime-memory tradeoffs in checkpointing by being fusion aware. Proceedings of Machine Learning and Systems, 5, 2023.
  32. Minicpm: Unveiling the potential of small language models with scalable training strategies, 2024.
  33. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  34. Mixtral of experts, 2024.
  35. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  36. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  37. LAION-AI. Open-Assistant: A chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically. https://github.com/LAION-AI/Open-Assistant, 2023.
  38. Platypus: Quick, cheap, and powerful refinement of llms, 2024.
  39. Camel: Communicative agents for ”mind” exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  40. Starcoder: may the source be with you!, 2023b.
  41. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
  42. Let’s verify step by step. preprint arXiv:2305.20050, 2023.
  43. lm sys. FastChat: An open platform for training, serving, and evaluating large language model based chatbots. https://github.com/lm-sys/FastChat, 2023.
  44. Locutusque. Ultratextbooks, 2024. URL https://huggingface.co/datasets/Locutusque/UltraTextbooks.
  45. The flan collection: Designing data and methods for effective instruction tuning, 2023.
  46. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  47. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
  48. Wizardcoder: Empowering code large language models with evol-instruct, 2023.
  49. Orca-math: Unlocking the potential of slms in grade school math, 2024.
  50. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023a.
  51. Crosslingual generalization through multitask finetuning, 2023b.
  52. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
  53. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15, 2021.
  54. Training language models to follow instructions with human feedback, 2022.
  55. Dense training, sparse inference: Rethinking training of mixture-of-experts language models, 2024.
  56. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  57. Openwebmath: An open dataset of high-quality mathematical web text, 2023.
  58. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  59. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  60. Direct preference optimization: Your language model is secretly a reward model, 2023.
  61. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  62. Zero: Memory optimizations toward training trillion parameter models, 2020.
  63. Code llama: Open foundation models for code, 2024.
  64. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  65. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  66. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  67. Moduleformer: Learning modular large language models from uncurated data. arXiv preprint arXiv:2306.04640, 2023.
  68. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  69. An empirical study of instruction-tuning large language models in chinese, 2023.
  70. Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.
  71. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  72. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  73. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  74. Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URL https://huggingface.co/datasets/teknium/OpenHermes-2.5.
  75. Teknium1. GPTeacher: A collection of modular datasets generated by GPT-4. https://github.com/teknium1/GPTeacher, 2023.
  76. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  77. The alignment handbook. https://github.com/huggingface/alignment-handbook, 2023a.
  78. Zephyr: Direct distillation of lm alignment, 2023b.
  79. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  80. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023a.
  81. Scibench: Evaluating college-level scientific problem-solving abilities of large language models, 2023b.
  82. Self-instruct: Aligning language model with self generated instructions, 2022.
  83. Magicoder: Source code is all you need, 2023.
  84. xai-org. Grok-1 open release. https://github.com/xai-org/grok-1, 2024.
  85. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024.
  86. Wizardlm: Empowering large language models to follow complex instructions, 2023a.
  87. Some things are more cringe than others: Preference optimization with the pairwise cringe loss, 2023b.
  88. Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.
  89. Metamath: Bootstrap your own mathematical questions for large language models, 2023.
  90. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012.
  91. Tinyllama: An open-source small language model, 2024a.
  92. Mixture of attention heads: Selecting attention heads per token. arXiv e-prints, pp.  arXiv–2210, 2022.
  93. Yifan Zhang. Stackmathqa: A curated collection of 2 million mathematical questions and answers sourced from stack exchange, 2024.
  94. Training language models with syntactic data generation, 2024b.
  95. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  96. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2024a.
  97. Opencodeinterpreter: Integrating code generation with execution and refinement, 2024b.
  98. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yikang Shen (62 papers)
  2. Zhen Guo (76 papers)
  3. Tianle Cai (34 papers)
  4. Zengyi Qin (15 papers)
Citations (15)