100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models
(2505.00551v3)
Published 1 May 2025 in cs.CL
Abstract: The recent development of reasoning LLMs (RLMs) represents a novel evolution in LLMs. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of LLMs. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.
This paper, "100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning LLMs" (Zhang et al., 1 May 2025), provides a comprehensive review of the research efforts inspired by the release of DeepSeek-R1, a reasoning LLM (RLM) that demonstrates explicit, step-by-step reasoning processes. The authors highlight that DeepSeek-R1's strong performance, particularly from its distilled smaller models, sparked significant interest in the research community despite the lack of full open-sourcing regarding its training details, especially for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Verifiable Rewards (RLVR).
The survey focuses on summarizing replication studies that have attempted to reproduce DeepSeek-R1's capabilities using publicly available data and similar training procedures. It structures its review around the two main training paradigms explored: SFT and RLVR, and discusses future directions for RLMs.
The paper details how SFT is used to train models to mimic the high-quality reasoning traces generated by powerful teacher models like DeepSeek-R1.
SFT Datasets: Replication studies primarily construct SFT datasets by collecting math, coding, and other reasoning problems from existing benchmarks or web sources. These datasets are then curated through rigorous filtering (deduplication, rejection sampling) and verification of Chain-of-Thoughts (CoTs) and solutions using domain-specific tools (like Math Verify) or LLM judges. While most datasets focus on math and coding, some, like AM [zhao202514millionopensourcedistilled] and the original DeepSeek-R1 data, include broader reasoning and non-reasoning tasks. The paper notes that datasets vary in token length distribution, with some skewed towards shorter sequences and others containing longer, more complex examples. It also highlights issues like data contamination across popular reasoning benchmarks and the complex cross-referencing structure among math reasoning datasets (illustrated in Figure 3).
SFT Training and Performance: The standard SFT objective is minimizing the negative log-likelihood of the generated CoT given the question. Table 2 compares the performance of various SFT-trained models on benchmarks like AIME24/25 and MATH500. Key observations include that strong performance can be achieved with smaller, high-quality datasets (LIMO [ye2025limoreasoning], S1k-1.1 [muennighoff2025s1simpletesttimescaling]), and that fine-tuning from instruct models might be more efficient than from base models. Common training practices involve adjusting RoPE scaling and maximum context length for long contexts, using learning rates around 1.0×10−5 to 5.0×10−5, and employing packing for efficiency.
Reinforcement Learning from Verifiable Rewards (RLVR) for Reasoning LLMs
The paper presents RLVR as a critical step for enhancing reasoning, particularly for achieving the "aha moment" or self-verification capabilities seen in models like DeepSeek-R1-Zero.
RL Datasets: The success of RLVR heavily relies on high-quality, verifiable data. Table 1 lists various open-source datasets curated for RL training, primarily focusing on math and coding. Curation involves selecting data resources, constructing verifiable questions and answers (often requiring strict formatting and rigorous verification via execution or specific tools), meticulous cleaning (removing unsolvable, unverifiable, or noisy samples), de-duplication, and decontamination against benchmarks. Some studies employ curriculum learning based on data difficulty, filtering for problems with moderate model pass rates to maximize learning opportunity.
RL Components:
Algorithms: Replication studies explore various RL algorithms and their variants (summarized in Table 3). The core methods are REINFORCE, PPO, and GRPO. The paper provides a unified theoretical framework, explaining the policy gradient estimation and the role of techniques like reward normalization, advantage estimation (GAE), importance sampling clipping, and KL-divergence penalties. Variants like DAPO [DAPO], Dr. GRPO [DrGRPO], CPPO [lin2025cppo], GPG [chu2025gpg], VC-PPO [VC-PPO], and VAPO [yuyue2025vapoefficientreliablereinforcement-vapo] are discussed, highlighting their motivations (e.g., improving efficiency, addressing instability, handling biases, enhancing variance reduction).
Rewards: Rule-based outcome rewards are preferred to minimize reward hacking. These typically include Accuracy Rewards (correct/incorrect), Format Rewards (penalizing deviations from desired structure), and Length Rewards (influencing output verbosity). While accuracy rewards are fundamental, format rewards' necessity is debated, and length rewards are sometimes used to encourage longer CoTs for difficult problems or brevity for easy ones.
Sampling Strategies: Techniques like curriculum learning on difficulty and various forms of rejection sampling (filtering out zero-advantage sample groups, history resampling) are used to improve sample efficiency and training stability.
RLVR Analysis and Discussions: Table 4 compares math reasoning performance of various RL-trained models. Insights from replication efforts suggest that careful data curation regarding quantity, diversity, and difficulty is key. Data cleaning is critical to ensure verifiability. While RL algorithms are not drastically different theoretically, engineering efforts have focused on stability. RL is shown effective across model sizes (1.5B to 32B) and model types (base and R1-distilled). Maximum response length and curriculum learning based on length are important for long-CoT reasoning. The effectiveness of KL loss regularization is debated, with some studies finding it restricts exploration and isn't essential for large-scale training, while others retain it with positive results.
RLVR on Other Tasks: The paper explores extending RLVR beyond math and coding to tasks with verifiable outcomes. Examples include Logical Reasoning (Countdown, Sudoku, Deductive Puzzles), Application-oriented Tasks (GitHub issue fixing, Machine Translation, Multi-hop QA using RAG, Chemistry tasks), and even Exploration Beyond Supervision (generating poetry, discovering sorting algorithms). These applications demonstrate RLVR's potential to foster complex reasoning and even knowledge discovery by learning from verifiable feedback.
More Directions for Reasoning LLMs
The survey concludes by discussing emerging research areas and challenges.
Alternative Approaches:
Reward Modeling: Moving beyond simplistic outcome rewards to capture intermediate steps. Process-level Reward Modeling (PRM) [xiong2024watch, song2025prmbench] provides step-level feedback. Variants like rStar-Math [guan2025rstar] use process preference models and self-evolution. PRIME [cui2025process] proposes implicit PRM learned from outcome labels.
Preference Optimization:DPO [Rafailov2023DirectPO] and its variants are explored as computationally cheaper alternatives to online RL methods like PPO/GRPO. Studies like EXAONE Deep [research2025exaone], Light-R1 [wen2025light], Iterative DPO [tu2025enhancing], RedStar [xu2025redstar], and DPO-R1 [zhang2025dpor1] apply DPO to reasoning, showing effectiveness but sometimes slightly lagging behind PPO.
Generalizability: Reasoning models show potential for generalization to out-of-distribution tasks. Continual pre-training, SFT (with high-quality data like LIMO), and especially RL are found to enhance generalization beyond memorization, even transferring abilities across languages and modalities. However, concerns are raised about potential trade-offs and limitations of RL on smaller models or certain metrics.
Safety: Reasoning LLMs introduce new safety challenges, including reward hacking, jailbreaking (shown effective on DeepSeek-R1), and overthinking (leading to increased costs). Research explores mitigating reward hacking, improving safety alignment (though it can impact reasoning performance), and developing reasoning-based safeguards.
Multimodal and Multilingual: Developing RLMs for multimodal (visual, audio, etc.) and multilingual contexts faces challenges. Multimodal models often have weaker reasoning than unimodal ones, and multilingual performance is limited by resource availability. Future work aims to improve training strategies and generalize reasoning capabilities across modalities and languages.
In conclusion, the survey provides a comprehensive overview of the progress and challenges in replicating and advancing DeepSeek-R1's reasoning capabilities. It emphasizes the critical roles of high-quality SFT data and verifiable RLVR data, explores various algorithmic and reward design choices, and highlights promising future directions in reward modeling, preference optimization, generalizability, safety, and multimodal/multilingual applications. The survey underscores the active research landscape focused on making RLMs more capable, reliable, and applicable to a wider range of real-world problems.