Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1.0k 6 2 17

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (2401.02954v1)

Published 5 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The rapid development of open-source LLMs has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source LLMs with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

PDF HTML Abstract

Introduction

LLMs are transforming the landscape of AI, empowering systems to handle tasks ranging from text summarization to complex code completion. Their development, largely based on decoder-only Transformers, leverages massive datasets for self-supervised pre-training followed by processes like supervised fine-tuning and reward modeling to better align with user intentions. Open-source models, albeit with substantial progress, still explore the extents of scaling these LLMs to better meet or exceed the performances of closed, proprietary systems.

Pre-Training and Architecture Insight

DeepSeek LLM unfolds as an open-source endeavor designed for the meticulous scaling of LLMs, a project born to fulfill the long-term objectives surrounding such models. The team developed a dataset containing a staggering 2 trillion tokens primarily in English and Chinese, targeting diversity and informational density. They adopted a robust architecture largely reflective of existing successful designs but added their insights, such as using a multi-step learning rate scheduler for efficient and optimized continued training. With model configurations set to 7B and 67B parameters, the infrastructure prioritizes effective communication and computation overlap to enhance resource utilization.

Scaling Laws and Model Optimization

A key contribution of this paper lies in the examination of scaling laws for LLMs. The researchers propose a new empirical framework for identifying optimal hyperparameters such as batch size and learning rate, necessary for near-optimal performance across varying compute budgets. The paper introduces a refined scaling-up strategy, emphasizing the significance of non-embedding FLOPs per token as a precise indicator of model scale. They discovered that the data quality significantly influences model scaling, with high-quality datasets encouraging the allocation of increased compute resources towards model size expansion. This insight compels the community to look beyond mere enlargement towards a strategic computational allocation based on data caliber.

Evaluation and Fine-Tuning

DeepSeek LLM's evaluation showcases its prowess across a broad spectrum of benchmarks, with the 67B model excelling in coding, mathematics, and reasoning. Their evaluation strategy also includes a safety assessment, ensuring the model's responses adhere to ethical standards. Further, the paper details the team's approach to fine-tuning, employing a dual-stage process to balance the model's specialized knowledge against its conversational abilities. The subsequent direct preference optimization solidifies the DeepSeek Chat models' effectiveness, making it a formidable competitor in the open-ended and help-oriented response generation.

Reflection and Future Work

While DeepSeek LLM carves a promising path in the open-source landscape of AI, it acknowledges inherent limitations, such as static knowledge post-training and the potential for generating unreliable content. The team is committed to continual advancement, with further improvements in dataset quality, language diversity, and alignment methodologies on the horizon. Their efforts signal a commitment not merely to enhance model capabilities but to ensure these AI systems serve the greater good responsibly and effectively while remaining accessible to the wider community.

PDF Markdown Bookmark Chat (Pro)

References (76)

Authors (88)

: (643 papers)
Xiao Bi (8 papers)
Deli Chen (20 papers)
Guanting Chen (19 papers)
Shanhuang Chen (4 papers)
Damai Dai (38 papers)
Chengqi Deng (11 papers)
Honghui Ding (4 papers)
Kai Dong (15 papers)
Qiushi Du (6 papers)
Zhe Fu (22 papers)
Huazuo Gao (9 papers)
Kaige Gao (3 papers)
Wenjun Gao (8 papers)
Ruiqi Ge (3 papers)
Kang Guan (6 papers)
Daya Guo (37 papers)
Jianzhong Guo (7 papers)
Guangbo Hao (4 papers)
Zhewen Hao (4 papers)

Citations (187)

View on Semantic Scholar

Tweets

https://twitter.com/Yampeleg/status/1822520975650332775

https://twitter.com/arankomatsuzaki/status/1744172567626236116

https://twitter.com/teortaxesTex/status/1787777472466936255

https://twitter.com/johannes_hage/status/1744324274641371336

https://twitter.com/teortaxesTex/status/1754785798400971177

https://twitter.com/AdeenaY8/status/1744353338366570774

YouTube

Show All Videos

HackerNews

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (2 points, 0 comments)

[2401.02954] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (6 points, 2 comments)