DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2501.12948v1)

Published 22 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

Authors (198)

DeepSeek-AI (5 papers)
Daya Guo (37 papers)
Dejian Yang (11 papers)
Haowei Zhang (17 papers)
Junxiao Song (12 papers)
Ruoyu Zhang (25 papers)
Runxin Xu (30 papers)
Qihao Zhu (27 papers)
Shirong Ma (23 papers)
Peiyi Wang (48 papers)
Xiao Bi (8 papers)
Xiaokang Zhang (42 papers)
Xingkai Yu (9 papers)
Yu Wu (196 papers)
Z. F. Wu (6 papers)
Zhibin Gou (15 papers)
Zhihong Shao (20 papers)
Zhuoshu Li (7 papers)
Ziyi Gao (3 papers)
Aixin Liu (4 papers)

Summary

The paper introduces DeepSeek-R1-Zero and DeepSeek-R1, reasoning models developed using large-scale Reinforcement Learning (RL). The DeepSeek-R1-Zero model is trained via RL without Supervised Fine-Tuning (SFT), demonstrating emergent reasoning capabilities. The DeepSeek-R1 model incorporates multi-stage training and cold-start data before RL to enhance reasoning performance and address issues like poor readability and language mixing. The DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. Additionally, the paper presents six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

Key Contributions

Large-Scale RL without SFT: The paper demonstrates that LLMs can develop reasoning capabilities through pure RL, without relying on SFT. DeepSeek-R1-Zero exhibits self-verification, reflection, and long Chain-of-Thoughts (CoTs). This validates that reasoning capabilities can be incentivized purely through RL.
Multi-Stage Training Pipeline: The paper introduces a pipeline for developing DeepSeek-R1, which incorporates two RL stages for discovering reasoning patterns and aligning with human preferences, and two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities.
Distillation to Smaller Models: The paper shows that reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to RL on small models. The distilled models, based on Qwen2.5 and Llama3, perform well on benchmarks. For instance, DeepSeek-R1-Distill-Qwen-7B achieves 55.5\% on AIME 2024, surpassing QwQ-32B-Preview.

Approach

The paper explores two primary approaches:

DeepSeek-R1-Zero: This model applies RL directly to the base model without SFT data.
DeepSeek-R1: This model applies RL starting from a checkpoint fine-tuned with thousands of long CoT examples and distills the reasoning capability to smaller dense models.

DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

The goal is to explore the potential of LLMs to develop reasoning capabilities without supervised data, focusing on self-evolution through a pure RL process. The base model used is DeepSeek-V3-Base, and the RL framework is Group Relative Policy Optimization (GRPO).

Group Relative Policy Optimization (GRPO): The policy model $\pi_{\theta}$ $π_{θ}$ is optimized by maximizing the following objective:

$\mathcal{J}_{GRPO}(\theta) = \mathbb{E}{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}(O|q)]} \frac{1}{G}\sum_{i=1}^G \left( \min \left( \frac{\pi_\theta(o_i |q)}{\pi_{\theta_{old}(o_i |q)} A_i, \text{clip} \left( \frac{\pi_\theta(o_i |q)}{\pi_{\theta_{old}(o_i |q)}, 1 - , 1 + \epsilon \right) A_i \right) - \beta \mathbb{D}_{KL}\left(\pi_{\theta} || \pi_{ref}\right)\right) ,}$

where:
- $q$ is the question
- $P(Q)$ is the distribution of questions
- $\{o_i\}_{i=1}^G$ are the outputs sampled from the old policy $\pi_{\theta_{old}}$
- $A_i$ is the advantage
- $\epsilon$ and $\beta$ are hyper-parameters
- $\mathbb{D}_{KL}\left(\pi_{\theta} || \pi_{ref}\right)$ is the KL divergence between the policy $\pi_{\theta}$ and a reference policy $\pi_{ref}$ .
$\mathbb{D}_{KL}\left(\pi_{\theta} || \pi_{ref}\right) = \frac{\pi_{ref}(o_i|q)}{\pi_{\theta}(o_i|q)}- \log\frac{\pi_{ref}(o_i|q)}{\pi_{\theta}(o_i|q)} - 1,$

where: * $\pi_{ref}(o_i|q)$ is the probability of output $o_i$ given question $q$ under the reference policy * $\pi_{\theta}(o_i|q)$ is the probability of output $o_i$ given question $q$ under the policy being trained
Reward Modeling: A rule-based reward system consisting of accuracy rewards and format rewards is used. The accuracy reward model evaluates whether the response is correct, while the format reward model enforces the model to put its thinking process between > and `` tags.
Training Template: A straightforward template guides the base model to produce a reasoning process followed by the final answer.

DeepSeek-R1: Reinforcement Learning with Cold Start

To improve reasoning performance and train a user-friendly model, a four-stage pipeline is designed:

Cold Start: A small amount of long CoT data is collected to fine-tune the model as the initial RL actor, using approaches such as few-shot prompting and human annotation. The output format is defined as |special_token|<reasoning_process>|special_token|<summary>.
Reasoning-oriented Reinforcement Learning: After fine-tuning on the cold start data, RL training is applied to enhance reasoning capabilities in coding, mathematics, science, and logic reasoning. A language consistency reward, calculated as the proportion of target language words in the CoT, is introduced to mitigate language mixing.
Rejection Sampling and Supervised Fine-Tuning: When reasoning-oriented RL converges, the resulting checkpoint is used to collect SFT data. The reasoning prompts are curated, and reasoning trajectories are generated by performing rejection sampling. Additional data is incorporated, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.
Reinforcement Learning for all Scenarios: A secondary RL stage is implemented to improve the model's helpfulness and harmlessness while refining its reasoning capabilities. The model is trained using a combination of reward signals and diverse prompt distributions.

Distillation: Empowering Small Models with Reasoning Capability

To equip more efficient smaller models with reasoning capabilities, open-source models like Qwen and Llama are directly fine-tuned using the curated samples. The base models used are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. Only SFT is applied, without an RL stage.

Experiment and Results

The models are evaluated on benchmarks such as MMLU, GPQA Diamond, SimpleQA, LiveCodeBench, Codeforces, and AIME 2024. Standard benchmarks are evaluated using prompts from the simple-evals framework.

DeepSeek-R1 Evaluation

DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3 on education-oriented knowledge benchmarks. It excels on FRAMES, a long-context-dependent QA task, and outperforms DeepSeek-V3 on the factual benchmark SimpleQA. DeepSeek-R1 also delivers impressive results on IF-Eval, AlpacaEval2.0, and ArenaHard. On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217.

Distilled Model Evaluation

Distilling DeepSeek-R1's outputs enables the distilled models to outperform non-reasoning models. For example, DeepSeek-R1-7B outperforms GPT-4o-0513 across the board, and DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation metrics.

The paper compares distillation and reinforcement learning. Distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on large-scale RL may not achieve the same performance. Applying RL to the distilled models yields further gains.

Unsuccessful Attempts

The paper discusses unsuccessful attempts, such as using Process Reward Models (PRM) and Monte Carlo Tree Search (MCTS). PRM suffers from challenges in defining fine-grain steps and reward hacking. MCTS encounters challenges in scaling up the training and training a fine-grained value model.

Conclusion and Future Work

The paper concludes that DeepSeek-R1-Zero represents a pure RL approach, while DeepSeek-R1 leverages cold-start data alongside iterative RL fine-tuning. The paper shows that reasoning capability can be distilled to small dense models effectively. Future research directions include enhancing general capabilities, addressing language mixing, improving prompt engineering, and improving performance on software engineering tasks.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/linuxformat/status/1883824760649585050

https://twitter.com/OnlyACartwright/status/1884235172935262582

https://twitter.com/vedangvatsa/status/1883648832007020784

https://twitter.com/virattt/status/1885102058715410739

https://twitter.com/zlatko_minev/status/1883593754386321787

https://twitter.com/hstyagi/status/1883118837446656391