DeepSeek-R1-Distill Model Series

Updated 1 December 2025

DeepSeek-R1-Distill Model Series are dense, open-source LLMs distilled via supervised fine-tuning on 800K chain-of-thought samples from an RL-optimized teacher.
They employ a multi-stage distillation process using cross-entropy, KL divergence, and entity-aware penalties to refine performance across diverse domains.
The models achieve near state-of-the-art accuracy on reasoning, STEM, biomedical, and sentiment tasks while enabling efficient deployment through quantization and LoRA adaptations.

The DeepSeek-R1-Distill Model Series consists of a family of dense, open-source LLMs distilled from the DeepSeek-R1 teacher, an RL-optimized Mixture-of-Experts model. These students inherit enhanced reasoning capabilities by supervised fine-tuning on high-quality reasoning traces, balancing state-of-the-art accuracy, compute efficiency, and practical deployment features across several domains including general reasoning, STEM, biomedicine, and safety-critical applications (DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025, Huang et al., 3 Feb 2025).

1. Model Family, Architecture, and Distillation Pipeline

The R1-Distill series spans multiple parameter scales and underlying architectures. All variants leverage publicly available Transformer backbones—primarily Qwen2.5 and Llama 3.x—refined via supervised distillation:

Model	Params	Layers	Hidden Dim	Heads
DeepSeek-R1-Distill-Qwen-1.5B	1.5 B	24	~2,048	16
DeepSeek-R1-Distill-Qwen-7B	7 B	32	~4,096	32
DeepSeek-R1-Distill-Llama-8B	8 B	32	~4,096	32
DeepSeek-R1-Distill-Qwen-14B	14 B	48	~6,144	48
DeepSeek-R1-Distill-Qwen-32B	32 B	64	~8,192	64
DeepSeek-R1-Distill-Llama-70B	70 B	80	12,288	96

Source: (DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025, Zhan et al., 1 Mar 2025).

The distillation follows a sequence:

Teacher generation: DeepSeek-R1 (671B MoE, 37B dense) samples chain-of-thought (CoT) trajectories over diverse reasoning prompts.
Sample curation: ~600K correct CoT samples (math, logic, coding, science) are collected by rejection sampling; ~200K instruction and factual samples complement the reasoning set.
Supervised fine-tuning: Dense student models (Qwen2.5/Llama) are trained for 2 epochs on this 800K-example set using token-level cross-entropy loss. No explicit reinforcement learning is applied at this stage (DeepSeek-AI et al., 22 Jan 2025, Zhan et al., 1 Mar 2025).
Loss composition: Some settings (notably in vertical adaptations) include a λ-weighted mix of cross-entropy (L_CE), KL divergence between teacher and student logits (L_KL), and MSE on intermediate states, especially for entity-aware or domain-specific distillation (Zhang et al., 25 Apr 2025, Zhao et al., 16 Feb 2025).
No further RL: Reinforcement learning is not used in the initial distillation, although RL post-finetuning or preference-based objectives can yield additional gains (Chen et al., 6 Mar 2025, Xu et al., 30 May 2025).

2. Performance on Reasoning and Domain Tasks

The R1-Distill family achieves state-of-the-art or near-state-of-the-art pass@1 accuracy on a spectrum of logic, mathematics, coding, and domain evaluations, outperforming base Qwen, Llama, and select proprietary models of similar scale across various settings (DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025, Jahin et al., 13 Mar 2025, Huang et al., 3 Feb 2025).

Quantitative results (selected, pass@1):

Benchmark	Qwen-1.5B	Qwen-7B	Qwen-14B	Qwen-32B	Llama-8B	Llama-70B
AIME 2024	28.9	55.5	69.7	72.6	50.4	70.0
MATH-500	83.9	92.8	93.9	94.3	89.1	94.5
GPQA Diamond	40.3	54.7	61.3	67.4	–	–

On the comprehensive A-Eval-2.0 benchmark (Zhao et al., 16 Feb 2025), the logical reasoning (LR) subscore increased most sharply with distillation, particularly for smaller models (Qwen-7B Distill: +16.4 points over base). Distilled Qwen-32B and Llama-70B attain mean A-Eval overall scores of 87.0% and 84.0%, respectively.

In explainability-intensive sentiment analysis, DeepSeek-R1-32B achieves 91.39% F1 on 5-way Amazon sentiment (30-shot) and 99.31% on binary IMDB (5-shot), with the chain-of-thought output format providing explicit intermediate reasoning (Huang et al., 3 Feb 2025).

In vertical medical QA (USMLE Step 1), DeepSeek-R1-Distill-7B reaches 92.1% accuracy after compression and LoRA adaptation (Zhang et al., 25 Apr 2025).

For biomedical NLP, students such as DeepSeek-R1-Distill-Llama-70B and Qwen-32B perform competitively in event extraction, relation extraction, NER, and classification, often matching or exceeding the F1 of Llama3-8B and Mistral-7B (Zhan et al., 1 Mar 2025).

3. Domain Adaptation, Compression, and Optimization

The R1-Distill series serves as a flexible foundation for vertical and resource-constrained deployments:

Medical vertical adaptation: LoRA-based parameter-efficient fine-tuning transfers medical knowledge from a fine-tuned 70B teacher to a 7B student, with an entity-aware distillation objective and rank-stabilized LoRA decomposition. Quantization to 4/8 bits and Flash Attention further reduce memory and latency to enable clinical deployment at 5.25GB RAM and 1.27s/query latency (Zhang et al., 25 Apr 2025).
Quantization: Although not always explicitly benchmarked, 4-bit and 8-bit quantization are feasible, with expected reasoning accuracy drops of 2–3 points (cf. other LLM quantification studies) (Zhao et al., 16 Feb 2025, Zhang et al., 25 Apr 2025).
Tool augmentation: RL- or SFT-based augmentation with code-execution traces and tool-calls yields large output quality gains in automated mathematics and programming domains, with up to 86.67% greedy accuracy on AIME 2024 for a 32B student after tool-infusion (Chen et al., 6 Mar 2025).
Specialized fusion: Branch-Merge distillation fuses domain-specific expert students (math, coding, science), optimized in isolation and merged via a KL-weighted Arcee Fusion mask, producing composite students (e.g., TinyR1-32B-Preview) that close most of the accuracy gap to very large teachers with substantially reduced training cost (Sun et al., 6 Mar 2025).

4. Safety, Robustness, and Evaluation Practices

Safety challenges and evaluation reproducibility remain critical concerns for R1-Distill and derivatives.

Safety Impact of Distillation: Teacher-to-student distillation on reasoning traces, when done naively, reduces safety, especially on tasks involving discrimination and refusal of harmful instructions (e.g., up to −25.43 points DI accuracy and significant drops in refusal/ responsibility rates, per CHiSafetyBench) (Zhang et al., 18 Mar 2025).
Safety Enhancement: A lightweight SFT regimen interleaving 50K safety-critical and CoT tasks restores and sometimes surpasses original safety levels, with reasoning accuracy preserved (ΔMATH-500 within ±3.1 points) (Zhang et al., 18 Mar 2025).
Deliberative alignment (RealSafe-R1): Full-trace SFT, where refusal is embedded within step-by-step reasoning chains, enables robust safety without performance loss; all sizes achieve near-zero compliance on StrongREJECT harmful tasks while maintaining or improving on arithmetic and QA benchmarks (Zhang et al., 14 Apr 2025).
Evaluation Variability and Protocols: Scores on AIME, GPQA Diamond, and other reasoning sets fluctuate substantially with random seed, evaluation batch size N, dataset versioning (especially with/without figures), instruction positioning, and answer permutation. Claimed model improvements under 3 points often fall within evaluation noise (Sun et al., 5 Jun 2025). Best practices are to report confidence intervals, control for all inference and prompt hyperparameters, and perform repeated runs.

5. Advanced Distillation Techniques and Reinforcement Objectives

Standard SFT: Core distillation uses token-level cross-entropy, occasionally augmented by teacher-student KL divergence and domain- or entity-aware penalties (DeepSeek-AI et al., 22 Jan 2025, Zhang et al., 25 Apr 2025, Zhan et al., 1 Mar 2025).
Preference-based and Reinforcement Distillation (REDI): Recent developments show two-stage pipelines—SFT on positive traces followed by preference tuning (e.g., REDI, DPO, SimPO)—yield further reasoning improvements in 1.5B models, especially when negative/incorrect traces are not discarded but used in the loss, as in:

$\mathcal{L}_\text{REDI}(\theta) = \mathbb{E}_{(x, y_w, y_l)}\big[ -\tfrac{1}{|y_w|} \log \pi_\theta(y_w|x) + \alpha \tfrac{1}{|y_l|} \log \pi_\theta(y_l|x) \big]$

with $\alpha=0.8$ (Xu et al., 30 May 2025).

RL Post-Distillation: On-policy RL fine-tuning, using correctness-based rewards and cosine-annealed KL regularization, further improves accuracy on reasoning benchmarks (e.g., +10.7 points post-RL on AIME 2024 for 1.5B), especially when initialized from a distilled checkpoint (Chen et al., 6 Mar 2025).

6. Application Domains and Practical Guidance

General reasoning: R1-Distill serves as a base for open scientific and engineering tasks; 32B and 70B students match or exceed proprietary models in STEM benchmarks at moderate compute cost (DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025).
Biomedical NLP: Full-trace distillation enables high F1 in event extraction, NER ( $>0.95$ ), and text classification; 14B/32B students offer a favorable accuracy/compute trade-off, while 7B models support on-device/low-memory applications (Zhan et al., 1 Mar 2025).
Medical QA: Adaptation pipelines for the medical domain demonstrate $>92\%$ Step 1 pass scores at $<6$ GB RAM, supporting edge deployment with negligible reasoning degradation (Zhang et al., 25 Apr 2025).
Sentiment and explainability: Distilled models, particularly the Qwen-32B variant, provide traceable, chain-of-thought explanations while maintaining near-SOTA few-shot F1/accuracy at 5–50 shots (Huang et al., 3 Feb 2025).

Model selection guidance is primarily task- and resource-driven, favoring 7B for edge, 14B/32B for balanced accuracy/efficiency, and 70B for maximal accuracy where resources permit (Zhao et al., 16 Feb 2025).

7. Limitations, Open Problems, and Future Directions

Distillation Impact: Smaller students trade substantial accuracy (10–30 points) for $>10\times$ speed/footprint reduction; critical reasoning capabilities derived via RL (e.g., self-verification, extended CoT) are not fully preserved by SFT alone (Jahin et al., 13 Mar 2025).
Robustness: Evaluation noise from run-time settings, prompt design, and dataset curation must be addressed; reported improvements should exceed empirical fluctuation intervals (Sun et al., 5 Jun 2025).
Advanced objectives: Incorporation of preference learning with explicit negative traces (as in REDI), tool-augmented SFT, or hybrid RLHF methods are promising to close the gap to teacher-level reasoning in compact models (Xu et al., 30 May 2025, Chen et al., 6 Mar 2025).
Safety and continual learning: Safety alignment does not compromise reasoning if refusal is embedded in full chain-of-thought traces; sustained monitoring and curriculum mixing are mandatory in deployment (Zhang et al., 14 Apr 2025, Zhang et al., 18 Mar 2025).
Domain adaptation: Further research is ongoing in efficient transfer pipelines (e.g., LoRA, fitting intermediate features), real-time retrieval (RAG), and continual learning to maintain accuracy in evolving scientific or clinical scenarios (Zhan et al., 1 Mar 2025, Zhang et al., 25 Apr 2025).