Emergent Mind

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

(2401.02954)
Published Jan 5, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

The rapid development of open-source LLMs has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

Overview

  • DeepSeek LLM is an open-source endeavor to scale LLMs, focusing on long-term advancements in AI.

  • The project involved creating a vast dataset of 2 trillion tokens, optimizing the architecture with efficient learning rate scheduling.

  • Empirical scaling laws were studied to identify the best hyperparameters for performance across different compute budgets.

  • Evaluations show that the DeepSeek LLM excels in coding, mathematics, reasoning, and ethical response generation.

  • The work acknowledges limitations such as static knowledge and potential for unreliable content, with plans for continued improvements.

Introduction

LLMs are transforming the landscape of AI, empowering systems to handle tasks ranging from text summarization to complex code completion. Their development, largely based on decoder-only Transformers, leverages massive datasets for self-supervised pre-training followed by processes like supervised fine-tuning and reward modeling to better align with user intentions. Open-source models, albeit with substantial progress, still explore the extents of scaling these LLMs to better meet or exceed the performances of closed, proprietary systems.

Pre-Training and Architecture Insight

DeepSeek LLM unfolds as an open-source endeavor designed for the meticulous scaling of LLMs, a project born to fulfill the long-term objectives surrounding such models. The team developed a dataset containing a staggering 2 trillion tokens primarily in English and Chinese, targeting diversity and informational density. They adopted a robust architecture largely reflective of existing successful designs but added their insights, such as using a multi-step learning rate scheduler for efficient and optimized continued training. With model configurations set to 7B and 67B parameters, the infrastructure prioritizes effective communication and computation overlap to enhance resource utilization.

Scaling Laws and Model Optimization

A key contribution of this paper lies in the examination of scaling laws for LLMs. The researchers propose a new empirical framework for identifying optimal hyperparameters such as batch size and learning rate, necessary for near-optimal performance across varying compute budgets. The study introduces a refined scaling-up strategy, emphasizing the significance of non-embedding FLOPs per token as a precise indicator of model scale. They discovered that the data quality significantly influences model scaling, with high-quality datasets encouraging the allocation of increased compute resources towards model size expansion. This insight compels the community to look beyond mere enlargement towards a strategic computational allocation based on data caliber.

Evaluation and Fine-Tuning

DeepSeek LLM's evaluation showcases its prowess across a broad spectrum of benchmarks, with the 67B model excelling in coding, mathematics, and reasoning. Their evaluation strategy also includes a safety assessment, ensuring the model's responses adhere to ethical standards. Further, the paper details the team's approach to fine-tuning, employing a dual-stage process to balance the model's specialized knowledge against its conversational abilities. The subsequent direct preference optimization solidifies the DeepSeek Chat models' effectiveness, making it a formidable competitor in the open-ended and help-oriented response generation.

Reflection and Future Work

While DeepSeek LLM carves a promising path in the open-source landscape of AI, it acknowledges inherent limitations, such as static knowledge post-training and the potential for generating unreliable content. The team is committed to continual advancement, with further improvements in dataset quality, language diversity, and alignment methodologies on the horizon. Their efforts signal a commitment not merely to enhance model capabilities but to ensure these AI systems serve the greater good responsibly and effectively while remaining accessible to the wider community.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
  2. Anthropic. Introducing Claude, 2023. https://www.anthropic.com/index/introducing-claude.

  3. Program Synthesis with Large Language Models
  4. Qwen Technical Report
  5. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press, 2020. 10.1609/aaai.v34i05.6239. https://doi.org/10.1609/aaai.v34i05.6239.

  6. Language models are few-shot learners
  7. Evaluating Large Language Models Trained on Code
  8. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  9. Training Verifiers to Solve Math Word Problems
  10. T. Computer. Redpajama: an open dataset for training LLMs, 2023. https://github.com/togethercomputer/RedPajama-Data.

  11. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
  12. T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  13. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems
  14. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335
  15. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2368–2378. Association for Computational Linguistics, 2019. 10.18653/V1/N19-1246. https://doi.org/10.18653/v1/n19-1246.

  16. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  17. Google. An important next step on our AI journey, 2023. https://blog.google/technology/ai/bard-google-ai-search-updates/.

  18. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
  19. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
  20. Measuring Massive Multitask Language Understanding
  21. Measuring Mathematical Problem Solving With the MATH Dataset
  22. Scaling Laws for Autoregressive Generative Modeling
  23. Deep Learning Scaling is Predictable, Empirically
  24. High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. https://www.high-flyer.cn/en/blog/hai-llm.

  25. Training Compute-Optimal Large Language Models
  26. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
  27. Huggingface Team. Tokenizers: Fast state-of-the-art tokenizers optimized for research and production, 2019. https://github.com/huggingface/tokenizers.

  28. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. https://openreview.net/pdf?id=fR3wGCk-IXp.

  29. Camels in a changing climate: Enhancing lm adaptation with tulu 2. 2023.
  30. Mistral 7B
  31. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. 10.18653/v1/P17-1147. https://aclanthology.org/P17-1147.

  32. Scaling Laws for Neural Language Models
  33. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5
  34. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019. 10.1162/tacla00276. https://doi.org/10.1162/tacl_a_00276.

  35. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
  36. RACE: large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785–794. Association for Computational Linguistics, 2017. 10.18653/V1/D17-1082. https://doi.org/10.18653/v1/d17-1082.

  37. CMMLU: Measuring massive multitask language understanding in Chinese
  38. Ccpm: A chinese classical poetry matching dataset
  39. AlignBench: Benchmarking Chinese Alignment of Large Language Models
  40. Decoupled Weight Decay Regularization
  41. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
  42. An Empirical Model of Large-Batch Training
  43. Can a suit of armor conduct electricity? a new dataset for open book question answering
  44. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15
  45. OpenAI. Introducing ChatGPT, 2022. https://openai.com/blog/chatgpt.

  46. GPT-4 Technical Report
  47. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  48. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  49. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  50. Direct preference optimization: Your language model is secretly a reward model. 2023.
  51. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE
  52. Winogrande: An adversarial winograd schema challenge at scale
  53. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49
  54. GLU Variants Improve Transformer
  55. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  56. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
  57. Don't Decay the Learning Rate, Increase the Batch Size
  58. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063
  59. Investigating prior knowledge for challenging chinese machine reading comprehension
  60. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
  61. LLaMA: Open and Efficient Foundation Language Models
  62. Llama 2: Open Foundation and Fine-Tuned Chat Models
  63. Attention is all you need. Advances in neural information processing systems, 30
  64. Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
  65. Chain-of-thought prompting elicits reasoning in LLMs. In NeurIPS, 2022. http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.

  66. Cmath: Can your language model pass chinese elementary school math test?
  67. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762–4772. International Committee on Computational Linguistics, 2020. 10.18653/V1/2020.COLING-MAIN.419. https://doi.org/10.18653/v1/2020.coling-main.419.

  68. Baichuan 2: Open large-scale language models. Technical report, Baichuan Inc., 2023. https://cdn.baichuan-ai.com/paper/Baichuan2-technical-report.pdf.

  69. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
  70. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. 10.18653/v1/p19-1472. https://doi.org/10.18653/v1/p19-1472.

  71. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32
  72. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32
  73. Chid: A large-scale chinese idiom dataset for cloze test. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 778–787. Association for Computational Linguistics, 2019. 10.18653/V1/P19-1075. https://doi.org/10.18653/v1/p19-1075.

  74. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023.
  75. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
  76. Instruction-Following Evaluation for Large Language Models

Show All 76