Papers
Topics
Authors
Recent
2000 character limit reached

DeepSeek-R1-Distill Model Series

Updated 1 December 2025
  • DeepSeek-R1-Distill Model Series are dense, open-source LLMs distilled via supervised fine-tuning on 800K chain-of-thought samples from an RL-optimized teacher.
  • They employ a multi-stage distillation process using cross-entropy, KL divergence, and entity-aware penalties to refine performance across diverse domains.
  • The models achieve near state-of-the-art accuracy on reasoning, STEM, biomedical, and sentiment tasks while enabling efficient deployment through quantization and LoRA adaptations.

The DeepSeek-R1-Distill Model Series consists of a family of dense, open-source LLMs distilled from the DeepSeek-R1 teacher, an RL-optimized Mixture-of-Experts model. These students inherit enhanced reasoning capabilities by supervised fine-tuning on high-quality reasoning traces, balancing state-of-the-art accuracy, compute efficiency, and practical deployment features across several domains including general reasoning, STEM, biomedicine, and safety-critical applications (DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025, Huang et al., 3 Feb 2025).

1. Model Family, Architecture, and Distillation Pipeline

The R1-Distill series spans multiple parameter scales and underlying architectures. All variants leverage publicly available Transformer backbones—primarily Qwen2.5 and Llama 3.x—refined via supervised distillation:

Model Params Layers Hidden Dim Heads
DeepSeek-R1-Distill-Qwen-1.5B 1.5 B 24 ~2,048 16
DeepSeek-R1-Distill-Qwen-7B 7 B 32 ~4,096 32
DeepSeek-R1-Distill-Llama-8B 8 B 32 ~4,096 32
DeepSeek-R1-Distill-Qwen-14B 14 B 48 ~6,144 48
DeepSeek-R1-Distill-Qwen-32B 32 B 64 ~8,192 64
DeepSeek-R1-Distill-Llama-70B 70 B 80 12,288 96

Source: (DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025, Zhan et al., 1 Mar 2025).

The distillation follows a sequence:

2. Performance on Reasoning and Domain Tasks

The R1-Distill family achieves state-of-the-art or near-state-of-the-art pass@1 accuracy on a spectrum of logic, mathematics, coding, and domain evaluations, outperforming base Qwen, Llama, and select proprietary models of similar scale across various settings (DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025, Jahin et al., 13 Mar 2025, Huang et al., 3 Feb 2025).

Quantitative results (selected, pass@1):

Benchmark Qwen-1.5B Qwen-7B Qwen-14B Qwen-32B Llama-8B Llama-70B
AIME 2024 28.9 55.5 69.7 72.6 50.4 70.0
MATH-500 83.9 92.8 93.9 94.3 89.1 94.5
GPQA Diamond 40.3 54.7 61.3 67.4

On the comprehensive A-Eval-2.0 benchmark (Zhao et al., 16 Feb 2025), the logical reasoning (LR) subscore increased most sharply with distillation, particularly for smaller models (Qwen-7B Distill: +16.4 points over base). Distilled Qwen-32B and Llama-70B attain mean A-Eval overall scores of 87.0% and 84.0%, respectively.

In explainability-intensive sentiment analysis, DeepSeek-R1-32B achieves 91.39% F1 on 5-way Amazon sentiment (30-shot) and 99.31% on binary IMDB (5-shot), with the chain-of-thought output format providing explicit intermediate reasoning (Huang et al., 3 Feb 2025).

In vertical medical QA (USMLE Step 1), DeepSeek-R1-Distill-7B reaches 92.1% accuracy after compression and LoRA adaptation (Zhang et al., 25 Apr 2025).

For biomedical NLP, students such as DeepSeek-R1-Distill-Llama-70B and Qwen-32B perform competitively in event extraction, relation extraction, NER, and classification, often matching or exceeding the F1 of Llama3-8B and Mistral-7B (Zhan et al., 1 Mar 2025).

3. Domain Adaptation, Compression, and Optimization

The R1-Distill series serves as a flexible foundation for vertical and resource-constrained deployments:

  • Medical vertical adaptation: LoRA-based parameter-efficient fine-tuning transfers medical knowledge from a fine-tuned 70B teacher to a 7B student, with an entity-aware distillation objective and rank-stabilized LoRA decomposition. Quantization to 4/8 bits and Flash Attention further reduce memory and latency to enable clinical deployment at 5.25GB RAM and 1.27s/query latency (Zhang et al., 25 Apr 2025).
  • Quantization: Although not always explicitly benchmarked, 4-bit and 8-bit quantization are feasible, with expected reasoning accuracy drops of 2–3 points (cf. other LLM quantification studies) (Zhao et al., 16 Feb 2025, Zhang et al., 25 Apr 2025).
  • Tool augmentation: RL- or SFT-based augmentation with code-execution traces and tool-calls yields large output quality gains in automated mathematics and programming domains, with up to 86.67% greedy accuracy on AIME 2024 for a 32B student after tool-infusion (Chen et al., 6 Mar 2025).
  • Specialized fusion: Branch-Merge distillation fuses domain-specific expert students (math, coding, science), optimized in isolation and merged via a KL-weighted Arcee Fusion mask, producing composite students (e.g., TinyR1-32B-Preview) that close most of the accuracy gap to very large teachers with substantially reduced training cost (Sun et al., 6 Mar 2025).

4. Safety, Robustness, and Evaluation Practices

Safety challenges and evaluation reproducibility remain critical concerns for R1-Distill and derivatives.

  • Safety Impact of Distillation: Teacher-to-student distillation on reasoning traces, when done naively, reduces safety, especially on tasks involving discrimination and refusal of harmful instructions (e.g., up to −25.43 points DI accuracy and significant drops in refusal/ responsibility rates, per CHiSafetyBench) (Zhang et al., 18 Mar 2025).
  • Safety Enhancement: A lightweight SFT regimen interleaving 50K safety-critical and CoT tasks restores and sometimes surpasses original safety levels, with reasoning accuracy preserved (ΔMATH-500 within ±3.1 points) (Zhang et al., 18 Mar 2025).
  • Deliberative alignment (RealSafe-R1): Full-trace SFT, where refusal is embedded within step-by-step reasoning chains, enables robust safety without performance loss; all sizes achieve near-zero compliance on StrongREJECT harmful tasks while maintaining or improving on arithmetic and QA benchmarks (Zhang et al., 14 Apr 2025).
  • Evaluation Variability and Protocols: Scores on AIME, GPQA Diamond, and other reasoning sets fluctuate substantially with random seed, evaluation batch size N, dataset versioning (especially with/without figures), instruction positioning, and answer permutation. Claimed model improvements under 3 points often fall within evaluation noise (Sun et al., 5 Jun 2025). Best practices are to report confidence intervals, control for all inference and prompt hyperparameters, and perform repeated runs.

5. Advanced Distillation Techniques and Reinforcement Objectives

  • Standard SFT: Core distillation uses token-level cross-entropy, occasionally augmented by teacher-student KL divergence and domain- or entity-aware penalties (DeepSeek-AI et al., 22 Jan 2025, Zhang et al., 25 Apr 2025, Zhan et al., 1 Mar 2025).
  • Preference-based and Reinforcement Distillation (REDI): Recent developments show two-stage pipelines—SFT on positive traces followed by preference tuning (e.g., REDI, DPO, SimPO)—yield further reasoning improvements in 1.5B models, especially when negative/incorrect traces are not discarded but used in the loss, as in:

LREDI(θ)=E(x,yw,yl)[1ywlogπθ(ywx)+α1yllogπθ(ylx)]\mathcal{L}_\text{REDI}(\theta) = \mathbb{E}_{(x, y_w, y_l)}\big[ -\tfrac{1}{|y_w|} \log \pi_\theta(y_w|x) + \alpha \tfrac{1}{|y_l|} \log \pi_\theta(y_l|x) \big]

with α=0.8\alpha=0.8 (Xu et al., 30 May 2025).

  • RL Post-Distillation: On-policy RL fine-tuning, using correctness-based rewards and cosine-annealed KL regularization, further improves accuracy on reasoning benchmarks (e.g., +10.7 points post-RL on AIME 2024 for 1.5B), especially when initialized from a distilled checkpoint (Chen et al., 6 Mar 2025).

6. Application Domains and Practical Guidance

  • General reasoning: R1-Distill serves as a base for open scientific and engineering tasks; 32B and 70B students match or exceed proprietary models in STEM benchmarks at moderate compute cost (DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025).
  • Biomedical NLP: Full-trace distillation enables high F1 in event extraction, NER (>0.95>0.95), and text classification; 14B/32B students offer a favorable accuracy/compute trade-off, while 7B models support on-device/low-memory applications (Zhan et al., 1 Mar 2025).
  • Medical QA: Adaptation pipelines for the medical domain demonstrate >92%>92\% Step 1 pass scores at <6<6GB RAM, supporting edge deployment with negligible reasoning degradation (Zhang et al., 25 Apr 2025).
  • Sentiment and explainability: Distilled models, particularly the Qwen-32B variant, provide traceable, chain-of-thought explanations while maintaining near-SOTA few-shot F1/accuracy at 5–50 shots (Huang et al., 3 Feb 2025).

Model selection guidance is primarily task- and resource-driven, favoring 7B for edge, 14B/32B for balanced accuracy/efficiency, and 70B for maximal accuracy where resources permit (Zhao et al., 16 Feb 2025).

7. Limitations, Open Problems, and Future Directions

  • Distillation Impact: Smaller students trade substantial accuracy (10–30 points) for >10×>10\times speed/footprint reduction; critical reasoning capabilities derived via RL (e.g., self-verification, extended CoT) are not fully preserved by SFT alone (Jahin et al., 13 Mar 2025).
  • Robustness: Evaluation noise from run-time settings, prompt design, and dataset curation must be addressed; reported improvements should exceed empirical fluctuation intervals (Sun et al., 5 Jun 2025).
  • Advanced objectives: Incorporation of preference learning with explicit negative traces (as in REDI), tool-augmented SFT, or hybrid RLHF methods are promising to close the gap to teacher-level reasoning in compact models (Xu et al., 30 May 2025, Chen et al., 6 Mar 2025).
  • Safety and continual learning: Safety alignment does not compromise reasoning if refusal is embedded in full chain-of-thought traces; sustained monitoring and curriculum mixing are mandatory in deployment (Zhang et al., 14 Apr 2025, Zhang et al., 18 Mar 2025).
  • Domain adaptation: Further research is ongoing in efficient transfer pipelines (e.g., LoRA, fitting intermediate features), real-time retrieval (RAG), and continual learning to maintain accuracy in evolving scientific or clinical scenarios (Zhan et al., 1 Mar 2025, Zhang et al., 25 Apr 2025).

References:

(DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025, Zhan et al., 1 Mar 2025, Chen et al., 6 Mar 2025, Sun et al., 6 Mar 2025, Jahin et al., 13 Mar 2025, Huang et al., 3 Feb 2025, Zhang et al., 18 Mar 2025, Zhang et al., 14 Apr 2025, Zhang et al., 25 Apr 2025, Xu et al., 30 May 2025, Sun et al., 5 Jun 2025)

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DeepSeek-R1-Distill Model Series.