Papers
Topics
Authors
Recent
2000 character limit reached

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Published 22 Jan 2025 in cs.CL, cs.AI, and cs.LG | (2501.12948v2)

Abstract: General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by LLMs and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.

Summary

  • The paper's main contribution is the introduction of a reinforcement learning framework that directly improves LLM reasoning without relying on supervised fine-tuning.
  • It details a multi-stage training process using Group Relative Policy Optimization, demonstrating enhanced reasoning accuracy and extended response times.
  • Evaluations show that distilled models derived from DeepSeek-R1 deliver efficiency and performance comparable to larger, established reasoning models.

DeepSeek-R1: Harnessing Reinforcement Learning to Enhance LLM Reasoning Capabilities

Introduction

LLMs have dramatically improved in their ability to process and generate human-like language, approaching features of AGI. A key challenge remains in enhancing their reasoning capabilities, particularly in test-time scenarios where model scalability is an issue. Previous attempts using approaches like supervised fine-tuning (SFT) and search algorithms failed to match the reasoning prowess of models such as OpenAI's o1 series. The paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" introduces DeepSeek-R1, a generation of reasoning models that leverage reinforcement learning (RL) for training without the prior requirement of supervised fine-tuning.

Reinforcement Learning Model

DeepSeek-R1-Zero, the base model, is developed using pure RL techniques, bypassing the typical SFT steps. This model inherently acquires reasoning abilities via an elaborate reinforcement learning process. The focus on RL facilitates the development of sophisticated reasoning behaviors, albeit with certain challenges such as readability and language consistency. These limitations prompted further refinement, resulting in the creation of DeepSeek-R1.

Benchmark performance of DeepSeek-R1 is illustrated below. Figure 1

Figure 1: Benchmark performance of DeepSeek-R1.

Implementation Approach

DeepSeek-R1-Zero was subjected to a reinforcement learning framework using Group Relative Policy Optimization (GRPO), which omits the need for a critic model, instead estimating baselines from group scores and optimizing policy models effectively. This strategy markedly enhanced reasoning performance.

To mitigate limitations seen in DeepSeek-R1-Zero, the researchers introduced a multi-stage training process in DeepSeek-R1. Cold-start data was used initially, followed by reasoning-oriented RL using specific methodologies to align model output with human preferences.

Throughout the training process, DeepSeek-R1-Zero exhibited a trajectory of improvement in accuracy as shown below. Figure 2

Figure 2: AIME accuracy of DeepSeek-R1-Zero during training. For each question, 16 responses are sampled and the average accuracy is calculated.

Moreover, DeepSeek-R1-Zero's RL training process also demonstrated advancements in the model's inherent capabilities to allocate enhanced computation time for reasoning tasks, depicted in the subsequent figure. Figure 3

Figure 3: The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.

Evaluation and Distillation

Performance evaluation on educational and factual benchmarks shows DeepSeek-R1 achieves parity with OpenAI's models in various reasoning tasks. This solidifies the notion that enhancing LLMs with RL frameworks could bridge existing performance gaps. The open-sourcing of DeepSeek-R1 and its distilled variants generates new opportunities for utilizing smaller yet powerful models in practice.

Comparative Analysis

The paper offers insights into the effectiveness of distilling reasoning patterns into smaller models. Through experiments with several dense models like Qwen and Llama, it showcased that distilled models not only outperform non-reasoning focused models but also are more efficient when subjected to extensive RL.

Future Work and Conclusion

The study suggests future efforts could focus on broader capability expansion of models like DeepSeek-R1, addressing language consistency issues, and exploring optimization in software engineering tasks. The emphasis remains on leveraging reinforcement learning to enhance reasoning capabilities in LLMs, which is a promising avenue for subsequent research in AI. This research underlines the potential of RL in creating more robust, reasoning-capable LLMs, suggesting an exciting direction for further advancements in AI technologies.

Paper to Video (Beta)

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper introduces two new AI models that are good at “thinking through” problems:

  • DeepSeek-R1-Zero
  • DeepSeek-R1

The main idea is to teach LLMs to reason better using reinforcement learning (RL)—a training method where the model tries things, gets rewarded for good behavior, and learns from that. The authors show that a model can learn powerful reasoning skills even without lots of human-written examples, and then improve readability and general usefulness with a small amount of carefully chosen data.

What questions are the researchers trying to answer?

The paper explores three simple questions:

  • Can an AI learn strong reasoning skills using only reinforcement learning, without first being taught with human-made examples?
  • If we add a small amount of high-quality “starter” examples, can we make the AI’s reasoning clearer, more readable, and even better?
  • Can we “teach” smaller, cheaper models to reason well by training them on the outputs of a larger reasoning model?

How did they train the models? (Methods explained simply)

Think of training the AI like coaching a student through practice problems:

  • Reinforcement Learning (RL): The model tries to solve problems (like math or coding). If it gets the answer right or follows the rules (like writing its “thinking” in a certain format), it earns points (rewards). Over time, it learns better ways to think and solve.
    • Accuracy rewards: Points for getting the final answer right.
    • Format rewards: Points for writing its reasoning in a clear structure, like putting the thinking between > ... and the final result between <answer> ... </answer>.
  • Group Relative Policy Optimization (GRPO): Imagine a group of attempts for the same question. Instead of hiring a separate “judge” model, the AI compares its own group of answers and learns from which ones scored better. This saves training cost but still teaches it which strategies work.
  • DeepSeek-R1-Zero (pure RL): The model starts with no special human examples. It just practices a lot with RL and learns reasoning by itself. It becomes very good but sometimes writes in a messy way (mixing languages, hard to read).
  • DeepSeek-R1 (multi-stage training): To fix the messy writing and improve general skills, they add a small “cold start” stage:

    1. Cold-start fine-tuning: A small set of human-friendly examples of long, clear reasoning. This helps the model learn a readable style.
    2. RL focused on reasoning: More practice on math, coding, science, and logic with rewards for accuracy and language consistency (keep the reasoning in the right language).
    3. Rejection sampling + supervised fine-tuning (SFT): They generate many solutions, keep the good ones, and also add data for general tasks like writing and Q&A. Then they train again to make the model helpful and coherent.
    4. RL for all scenarios: Final polishing with rewards that balance being helpful, safe, and still strong at reasoning.
  • Distillation (teaching smaller models): The big model (DeepSeek-R1) generates lots of good reasoning examples. Smaller models (like Qwen and Llama versions) are trained on those examples to “learn the style” and become much better at reasoning without expensive RL. This is like learning from a top student’s solved practice papers.

What did they find?

Here are the most important results and why they matter:

  • Pure RL works: DeepSeek-R1-Zero learned strong reasoning without any human-labeled training. For example, on a hard math test (AIME 2024), it improved from about 16% to 71% accuracy, and up to 86.7% with majority voting (picking the most common answer from many tries). It also started “thinking longer” on tough problems and developed behaviors like checking its own work—almost like a natural “aha moment.”
  • Readability and general skills improve with a small cold start: DeepSeek-R1 (with a little curated training data before RL) became both smarter and easier to understand—cleaner reasoning, fewer language mix-ups, and better summaries. It reached performance close to OpenAI’s o1-1217 on math and coding and stayed competitive on knowledge tests.
  • Strong benchmark performance:
    • Math: DeepSeek-R1 scored around 79.8% on AIME 2024 and 97.3% on MATH-500—among the best.
    • Coding: On Codeforces (a real competitive programming platform), it reached a rating of 2029, higher than about 96% of human competitors tested in that slice.
    • Knowledge and writing: It did very well on exams like MMLU and in writing tests (AlpacaEval and ArenaHard), showing it’s not only good at math but also at general tasks.
  • Distillation makes small models powerful:
    • Smaller models trained on DeepSeek-R1’s outputs became strong reasoners. For example, a 14B distilled model beat a well-known 32B open-source model (QwQ-32B-Preview) on key benchmarks.
    • The 32B and 70B distilled models reached near state-of-the-art results among dense (non-MoE) open models.
  • Practical lessons:
    • Reward models judging the “process” can cause “reward hacking” (the model learns to game the scoring rather than truly reason), so the authors leaned on rule-based checks and careful training.
    • Search-based methods like MCTS (used in games like Go) didn’t scale well for language reasoning because the “search space” of text is too big and hard to measure step-by-step.

Why does this matter?

  • RL can teach AI to reason: This shows that an AI can develop complex reasoning skills by practicing and getting feedback, even without huge sets of human-annotated examples.
  • Clearer, safer, more helpful AI: With a small amount of human guidance and careful rewards, the model’s thinking becomes readable and aligned with what users want.
  • Better, cheaper models for everyone: Distillation means strong reasoning can be shared with smaller models that are cheaper to run, helping researchers, students, and developers.
  • Open-source impact: The authors released models and checkpoints (from 1.5B up to 70B) so the community can build on this work.

Simple takeaways and future impact

  • You can teach a model to think better just by rewarding it for good problem-solving and clear explanations.
  • Adding a small starter set of good examples makes the model’s thinking easier to read and improves general usefulness.
  • Big models can “teach” small models, spreading advanced reasoning more widely.
  • This approach could improve AI tutors, coding assistants, math solvers, research helpers, and any tool that benefits from careful, step-by-step thinking.
  • Future models may push reasoning further by combining smarter RL strategies, better reward design, and improved safety and clarity—bringing AI closer to truly reliable, general problem-solving.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 293 tweets with 72921 likes about this paper.

HackerNews