Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (2404.06395v3)

Published 9 Apr 2024 in cs.CL and cs.LG
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Abstract: The burgeoning interest in developing LLMs with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small LLMs (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .

MiniCPM: Demonstrating the Efficiency and Scalability of Small LLMs

Introduction

The paper "MiniCPM: Unveiling the Potential of Small LLMs with Scalable Training Strategies" explores the field of Small LLMs (SLMs) as an alternative to the more commonly discussed LLMs. The authors bring to light the significant capabilities of MiniCPM, a family of models particularly the 1.2B and 2.4B non-embedding variant models, asserting their remarkable performance, which competes with larger counterparts ranging from 7B to 13B parameters. This paper emphasizes a scalable approach in training strategies, which can be beneficial for both model and data dimensions, setting a potential pathway for future research into larger models.

Model Wind Tunnel Experiment (MWTE)

The paper introduces the concept of Model Wind Tunnel Experiments (MWTE), aimed at exploring the limits of SLMs before transitioning learned insights to LLMs. The MWTE comprises extensive hyper-parameter optimization, optimal batch-size scaling, and learning rate stability, among other factors. Such comprehensive testing, inspired by aerodynamic wind tunnel testing, is crucial for understanding the scalability and stability of SLMs, thereby informing the development strategy for larger models.

Warmup-Stable-Decay Learning Rate Scheduler (WSD LRS)

One of the notable contributions of this research is the development of the WSD learning rate scheduler, conducive to continuous training and domain adaptation. The WSD scheduler demonstrates unique training dynamics, particularly during the decay phase, where a notable decrease in loss is observed. This insight can drastically reduce the effort in studying data-model scaling laws, providing an efficient alternative to traditionally computationally intense approaches. Furthermore, the WSD LRS facilitates an understanding of training dynamics not previously captured with common practices.

MiniCPM Family: Diverse Applications and Scalability

The introduction of the MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE, and MiniCPM-128K, exemplifies the diversity and scalability of SLMs. Each variant targets different application areas or technical challenges, from preference alignment through reinforcement learning to handling long-context tasks. This diversity not only demonstrates the robustness of MiniCPM models but also their adaptability to a wide range of AI tasks, further reinforcing the potential of SLMs in practical applications.

Implications and Future Directions

This research underlines a critical consideration in the AI field: the importance of exploring efficient and scalable training strategies for SLMs. The demonstrated efficiency of MiniCPM models suggests a reevaluation of the current focus on exponentially growing LLMs, advocating for a scientific and sustainable model scaling approach. Moreover, the successful application of WSD LRS introduces a promising direction for optimizing training strategies, potentially impacting future developments in both SLMs and LLMs.

Conclusion

The paper "MiniCPM: Unveiling the Potential of Small LLMs with Scalable Training Strategies" accentuates the untapped potential of SLMs for achieving remarkable performance on par with LLMs, highlighting the significance of efficient training methodologies. The scalability demonstrated through various MiniCPM variants suggests a broad applicability of SLMs, further advocating for their utility in research and practical deployments. This work paves the way for future explorations into more sustainable, efficient, and scientifically grounded approaches to model training and scaling within the AI community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pp.  265–279. PMLR, 2023.
  3. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4895–4901, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.298. URL https://aclanthology.org/2023.emnlp-main.298.
  4. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  5. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  6. Program synthesis with large language models, 2021.
  7. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  8. Gemma: Introducing new state-of-the-art open models. https://blog.google/technology/developers/gemma-open-models/, 2024. Accessed: date-of-access.
  9. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  10. Datasheet for the pile. arXiv preprint arXiv:2201.07311, 2022.
  11. bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/, 2023. Accessed: [Insert Date of Access].
  12. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp.  21–29. IEEE, 1997.
  13. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  14. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  15. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  16. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  17. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  18. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208, 2023.
  19. Enhancing chat language models by scaling high-quality instructional conversations, 2023.
  20. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  21. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
  22. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  23. GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323, 2022. doi: 10.48550/ARXIV.2210.17323. URL https://doi.org/10.48550/arXiv.2210.17323.
  24. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  25. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  26. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  27. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  28. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  29. Query-key normalization for transformers. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4246–4253, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.379. URL https://aclanthology.org/2020.findings-emnlp.379.
  30. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  31. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
  32. Unlock predictable scaling from emergent abilities. arXiv preprint arXiv:2310.03262, 2023.
  33. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  34. sharpdarts: Faster and more accurate differentiable architecture search. arXiv preprint arXiv:1903.09900, 2019.
  35. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023. Accessed: date-of-access.
  36. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  37. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  38. The stack: 3 tb of permissively licensed source code. Preprint, 2022.
  39. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
  40. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, 2023.
  41. Cmmlu: Measuring massive multitask language understanding in chinese, 2024.
  42. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023a.
  43. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
  44. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023a. URL https://https://huggingface.co/Open-Orca/SlimOrca.
  45. Slimorca dedup: A deduplicated subset of slimorca, 2023b. URL https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup/.
  46. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905, 2024.
  47. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  48. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  49. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  50. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  51. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  52. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  53. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  54. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.
  55. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
  56. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638, 2023.
  57. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
  58. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.00159.
  59. Zebra: Extending context window with layerwise grouped local-global attention. arXiv preprint arXiv:2312.08618, 2023.
  60. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  61. LLMFarm team. LLMFarm, 2023a. URL https://github.com/guinmoon/LLMFarm.
  62. MLC team. MLC-LLM, 2023b. URL https://github.com/mlc-ai/mlc-llm.
  63. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  64. Zephyr: Direct distillation of lm alignment, 2023.
  65. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  66. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
  67. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023.
  68. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  69. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024.
  70. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  71. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  72. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
  73. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023.
  74. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. arXiv preprint arXiv:2403.16952, 2024.
  75. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  76. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  77. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024a.
  78. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
  79. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens. arXiv preprint arXiv:2402.13718, 2024b.
  80. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (25)
  1. Shengding Hu (34 papers)
  2. Yuge Tu (4 papers)
  3. Xu Han (270 papers)
  4. Chaoqun He (5 papers)
  5. Ganqu Cui (39 papers)
  6. Xiang Long (29 papers)
  7. Zhi Zheng (46 papers)
  8. Yewei Fang (7 papers)
  9. Yuxiang Huang (17 papers)
  10. Weilin Zhao (22 papers)
  11. Xinrong Zhang (9 papers)
  12. Zheng Leng Thai (1 paper)
  13. Kaihuo Zhang (4 papers)
  14. Chongyi Wang (6 papers)
  15. Yuan Yao (292 papers)
  16. Chenyang Zhao (39 papers)
  17. Jie Zhou (687 papers)
  18. Jie Cai (44 papers)
  19. Zhongwu Zhai (2 papers)
  20. Ning Ding (122 papers)
Citations (171)
Youtube Logo Streamline Icon: https://streamlinehq.com