Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (2401.02954v1)

Published 5 Jan 2024 in cs.CL, cs.AI, and cs.LG
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Abstract: The rapid development of open-source LLMs has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source LLMs with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

Introduction

LLMs are transforming the landscape of AI, empowering systems to handle tasks ranging from text summarization to complex code completion. Their development, largely based on decoder-only Transformers, leverages massive datasets for self-supervised pre-training followed by processes like supervised fine-tuning and reward modeling to better align with user intentions. Open-source models, albeit with substantial progress, still explore the extents of scaling these LLMs to better meet or exceed the performances of closed, proprietary systems.

Pre-Training and Architecture Insight

DeepSeek LLM unfolds as an open-source endeavor designed for the meticulous scaling of LLMs, a project born to fulfill the long-term objectives surrounding such models. The team developed a dataset containing a staggering 2 trillion tokens primarily in English and Chinese, targeting diversity and informational density. They adopted a robust architecture largely reflective of existing successful designs but added their insights, such as using a multi-step learning rate scheduler for efficient and optimized continued training. With model configurations set to 7B and 67B parameters, the infrastructure prioritizes effective communication and computation overlap to enhance resource utilization.

Scaling Laws and Model Optimization

A key contribution of this paper lies in the examination of scaling laws for LLMs. The researchers propose a new empirical framework for identifying optimal hyperparameters such as batch size and learning rate, necessary for near-optimal performance across varying compute budgets. The paper introduces a refined scaling-up strategy, emphasizing the significance of non-embedding FLOPs per token as a precise indicator of model scale. They discovered that the data quality significantly influences model scaling, with high-quality datasets encouraging the allocation of increased compute resources towards model size expansion. This insight compels the community to look beyond mere enlargement towards a strategic computational allocation based on data caliber.

Evaluation and Fine-Tuning

DeepSeek LLM's evaluation showcases its prowess across a broad spectrum of benchmarks, with the 67B model excelling in coding, mathematics, and reasoning. Their evaluation strategy also includes a safety assessment, ensuring the model's responses adhere to ethical standards. Further, the paper details the team's approach to fine-tuning, employing a dual-stage process to balance the model's specialized knowledge against its conversational abilities. The subsequent direct preference optimization solidifies the DeepSeek Chat models' effectiveness, making it a formidable competitor in the open-ended and help-oriented response generation.

Reflection and Future Work

While DeepSeek LLM carves a promising path in the open-source landscape of AI, it acknowledges inherent limitations, such as static knowledge post-training and the potential for generating unreliable content. The team is committed to continual advancement, with further improvements in dataset quality, language diversity, and alignment methodologies on the horizon. Their efforts signal a commitment not merely to enhance model capabilities but to ensure these AI systems serve the greater good responsibly and effectively while remaining accessible to the wider community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  2. Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  5. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press, 2020. 10.1609/aaai.v34i05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
  6. Language models are few-shot learners, 2020.
  7. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  8. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. T. Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  11. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  12. T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  13. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  14. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  15. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2368–2378. Association for Computational Linguistics, 2019. 10.18653/V1/N19-1246. URL https://doi.org/10.18653/v1/n19-1246.
  16. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  17. Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/.
  18. Tora: A tool-integrated reasoning agent for mathematical problem solving. CoRR, abs/2309.17452, 2023. 10.48550/ARXIV.2309.17452. URL https://doi.org/10.48550/arXiv.2309.17452.
  19. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  20. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  21. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  22. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  23. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  24. High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm.
  25. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. 10.48550/ARXIV.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
  26. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
  27. Huggingface Team. Tokenizers: Fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers.
  28. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR3wGCk-IXp.
  29. Camels in a changing climate: Enhancing lm adaptation with tulu 2. 2023.
  30. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  31. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  32. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  33. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
  34. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://doi.org/10.1162/tacl_a_00276.
  35. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  36. RACE: large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785–794. Association for Computational Linguistics, 2017. 10.18653/V1/D17-1082. URL https://doi.org/10.18653/v1/d17-1082.
  37. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023.
  38. Ccpm: A chinese classical poetry matching dataset, 2021.
  39. Alignbench: Benchmarking chinese alignment of large language models. CoRR, abs/2311.18743, 2023. 10.48550/ARXIV.2311.18743. URL https://doi.org/10.48550/arXiv.2311.18743.
  40. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  41. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  42. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
  43. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
  44. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  45. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  46. OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  47. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  48. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  49. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  50. Direct preference optimization: Your language model is secretly a reward model. 2023.
  51. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  52. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  53. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49, 2019.
  54. N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  55. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  56. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  57. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
  58. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  59. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019.
  60. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  61. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  62. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  63. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  64. Do-not-answer: A dataset for evaluating safeguards in llms. CoRR, abs/2308.13387, 2023. 10.48550/ARXIV.2308.13387. URL https://doi.org/10.48550/arXiv.2308.13387.
  65. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
  66. Cmath: Can your language model pass chinese elementary school math test?, 2023.
  67. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762–4772. International Committee on Computational Linguistics, 2020. 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419.
  68. Baichuan 2: Open large-scale language models. Technical report, Baichuan Inc., 2023. URL https://cdn.baichuan-ai.com/paper/Baichuan2-technical-report.pdf.
  69. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284, 2023. 10.48550/ARXIV.2309.12284. URL https://doi.org/10.48550/arXiv.2309.12284.
  70. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p19-1472.
  71. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  72. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019.
  73. Chid: A large-scale chinese idiom dataset for cloze test. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 778–787. Association for Computational Linguistics, 2019. 10.18653/V1/P19-1075. URL https://doi.org/10.18653/v1/p19-1075.
  74. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023.
  75. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
  76. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (88)
  1. : (643 papers)
  2. Xiao Bi (8 papers)
  3. Deli Chen (20 papers)
  4. Guanting Chen (19 papers)
  5. Shanhuang Chen (4 papers)
  6. Damai Dai (38 papers)
  7. Chengqi Deng (11 papers)
  8. Honghui Ding (4 papers)
  9. Kai Dong (15 papers)
  10. Qiushi Du (6 papers)
  11. Zhe Fu (22 papers)
  12. Huazuo Gao (9 papers)
  13. Kaige Gao (3 papers)
  14. Wenjun Gao (8 papers)
  15. Ruiqi Ge (3 papers)
  16. Kang Guan (6 papers)
  17. Daya Guo (37 papers)
  18. Jianzhong Guo (7 papers)
  19. Guangbo Hao (4 papers)
  20. Zhewen Hao (4 papers)
Citations (187)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com