DeepSeek-R1-Distill-1.5B: Distilled Reasoning LLM

Updated 26 March 2026

The paper introduces a distillation pipeline that compresses advanced chain-of-thought reasoning from large teacher models into a compact 1.5B parameter architecture.
DeepSeek-R1-Distill-1.5B employs skill-aware data selection, RL fine-tuning, and MCTS-based chain-of-thought generation to boost performance across logical, mathematical, and real-world tasks.
The model achieves significant accuracy gains in benchmarks while balancing efficiency with a tradeoff in extended multi-path reasoning capabilities.

DeepSeek-R1-Distill-1.5B is a family of compact, 1.5-billion-parameter transformer models distilled from the large-scale DeepSeek-R1 "reasoning" LLMs. The primary aim is to compress the advanced chain-of-thought (CoT) reasoning, self-verification, and reflection strategies of DeepSeek-R1 (analogs to "slow thinker" or o1-style models) into resource-efficient dense architectures suitable for real-time or edge deployment—while retaining as much of the teacher's reasoning capability as possible. These distilled models employ a suite of teacher-student distillation, skill-centric data selection, RL fine-tuning, and safety-enhanced SFT techniques validated across mathematical, logical, and real-world tasks.

1. Architecture and Distillation Protocols

The backbone for DeepSeek-R1-Distill-1.5B is generally the Qwen2.5-Math-1.5B, a dense decoder-only transformer comprising 24–28 layers, 32 attention heads, hidden size in the 1280–2048 range, and employing standard positional encoding and tokenization as in the Qwen/Llama family (Zhao et al., 16 Feb 2025, Chen et al., 6 Mar 2025, Zhang et al., 18 Mar 2025). No Mixture-of-Experts, adapters, or architectural augmentations are introduced in the distillation; the knowledge transfer occurs entirely through post hoc supervised and semi-supervised training.

The canonical student–teacher setup involves the following pipeline:

Teacher model: DeepSeek-R1 (e.g., 671B, 37B MoE) RL-finetuned on mathematical, logical, and general chain-of-thought tasks.
Distillation data: Typically, 50–200k high-quality teacher outputs, emphasizing mathematical reasoning, logical puzzles, and QA, with filtering for answer correctness and formatting. Datasets may blend OpenThoughts, “reasoning-enhancement” corpora, or DeepSeek-R1’s instruction-tuning traces (Zhao et al., 16 Feb 2025, Chen et al., 6 Mar 2025, Zhang et al., 18 Mar 2025).
Distillation loss: A mixture of next-token cross-entropy on teacher-provided CoT traces and, in some variants, softened KL divergence of student vs. teacher logits with temperature scaling ( $T\sim1\text{–}2$ ). For nearly all models, the dominant loss is standard generation cross-entropy; explicit KD is occasionally reported, but not always present in the final optimized pipeline (Chen et al., 6 Mar 2025, Zhao et al., 16 Feb 2025).

2. Reinforcement Learning and Post-Distillation Refinement

RL fine-tuning is a key strategy for further improving reasoning capability post-distillation (Chen et al., 6 Mar 2025). The RL phase utilizes Proximal Policy Optimization (PPO), with a policy ( $\pi_\theta$ ) initialized from the distilled checkpoint, and task-specific reward functions tightly linked to solution correctness and reasoning format. Typical reward structure:

$R_\text{output}=1$ if solution matches ground truth (often in $\boxed{\cdot}$ format), else $0$.
$R_\text{fmt}=1$ if output follows prescribed reasoning–answer separation, else $0$.
Total reward $R_i=R_\text{output}+R_\text{fmt}$ . PPO controls include $\epsilon=0.2$ , KL penalty $\beta_\text{KL}=10^{-3}$ , on-policy rollouts, etc.

This RL refinement reliably delivers large accuracy gains: e.g., on AIME 2024, deepDistill baseline 28.7% → 39.3% after RL; similarly, Omni-MATH jumps from 25.6% → 33.0% average with minimal loss in response length (Chen et al., 6 Mar 2025).

3. Specialized Distillation Methodologies

Several key advances in efficient distillation and data selection underpin the best-performing DeepSeek-R1-Distill-1.5B variants:

Skill-aware data selection leverages hierarchical skill trees and online per-skill accuracy profiling to concentrate distillation on the student’s weakest skills, leading to flattened per-skill error curves and more uniform competence (Zhang et al., 15 Jan 2026).
Monte Carlo Tree Search (MCTS)-based chain-of-thought data: Tree-structured CoT reasoning paths are generated via MCTS with semantic roles (Thinking, Reflection, Sub-Task, etc.). SFT is performed using the longest correct paths, and preference alignment (DPO, cDPO) is applied using minimal correct vs. incorrect path pairs, leading to improved diversity and robustness in reasoning (Yin et al., 3 Mar 2025).
Joint objective training: Weighted blends of SFT and preference loss (e.g., $\mathcal{L}_\text{total} = \mathcal{L}_\text{DPO} + \alpha \mathcal{L}_\text{SFT}$ ) help maintain reasoning syntax while aligning behavior to preference data.

4. Benchmarking and Reasoning Performance

Distilled 1.5B models achieve large relative gains over baseline Qwen2.5-Math-1.5B across both synthetic reasoning and real-world benchmarks. Representative results:

Category	Baseline 1.5B	DeepSeek-R1-Distill-1.5B
Text Understanding	17.4	44.2
Logical Reasoning	22.4	62.3
Math Accuracy (MATH)	47.0	65.6–86.3
AIME 2024	11.7	28.7–31.0

Scores from (Zhao et al., 16 Feb 2025, Jahin et al., 13 Mar 2025, Chen et al., 6 Mar 2025, Zhang et al., 14 Apr 2025).

The most salient quantitative gains are seen in logical reasoning, algebra, and long-form chain-of-thought math, where chain-of-thought accuracy and error correction increase by 2–3× over non-distilled baselines (Zhao et al., 16 Feb 2025). RL refinement further boosts AIME 2024 by +10.7 points, and pass@1 gains on advanced math sets reach 6–7 pp (Chen et al., 6 Mar 2025). MCTS-based methods yield MATH Test@8 accuracy of 70.8% at 1.5B scale, which significantly exceeds baseline SFT (Yin et al., 3 Mar 2025).

In real-world agentic and planning tasks, DeepSeek-R1-Distill-1.5B as a chain-of-thought discriminator outperforms non-reasoning LLMs up to 13B parameters in SQL ranking F1, pairwise discrimination accuracy, and execution accuracy—demonstrating that discrimination often benefits more from compact reasoning LLMs than generation does (Anjum, 30 Apr 2025).

5. Reasoning Efficiency, Value Modeling, and Inference Scalability

To further optimize runtime and inference cost, token-level value modeling and value-guided search (VGS) techniques have been developed (Wang et al., 23 May 2025). A 1.5B parameter value model is trained on 2.5M labeled reasoning traces, and VGS is applied during inference to blockwise sample, score, and vote on generations. This yields:

Test-time accuracy of 45.7% average across four math competition benchmarks at N=256 generations with >75% FLOPs reduction versus majority voting.
VGS+WMV consistently outperforms both naive best-of-N and process reward model (PRM) guidance at similar compute budgets.

This scalable reasoning pipeline is critical for edge and cost-constrained applications, as it delivers high accuracy per FLOP (Wang et al., 23 May 2025, Zhao et al., 16 Feb 2025).

6. Safety Alignment and Refusal Behavior

DeepSeek-R1-Distill-1.5B has been rigorously evaluated for safety, particularly in Chinese contexts. Distillation from R1 lifts overall CHiSafetyBench MCQ risk-identification accuracy by +2.9 points (54.44% vs. 51.5%) but initially degrades the model's ability to responsibly refuse risky or harmful prompts (RR₂: -2.2 percentage points). Targeted SFT on ~50k safety and chain-of-thought examples realigns the model: risky MCQ accuracy rises to 59.8%, responsible refusal rates (RR₁/RR₂) rise by 6–10 points, and harmful reply rates halve—all without measurable performance drop on core reasoning tasks (Zhang et al., 18 Mar 2025). An independent safety-alignment study (RealSafe-R1) adopts explicit refusal-tuned SFT, reducing harm scores to zero and boosting full refusals from 35.5% → 85.5% on unsafe queries with negligible (<1 pp) hit to average reasoning benchmarks (Zhang et al., 14 Apr 2025).

7. Tool and Code-Integrated Distilled Reasoners

The 1.5B family supports code interpreter integration via post-training reinforcement learning (CoRT framework). After distillation, code-integrated RL (via GRPO or similar) with hint-injected data reduces token utilization by up to 50% and increases math accuracy by +8 absolute points (e.g., AIME 2024 from 28.8%→41.0%) compared to natural-language-only baselines. Hint-engineering and rejection fine-tuning drive the model to trust interpreter results and avoid redundant manual verification, maximizing code call efficiency (Li et al., 23 Oct 2025).

8. Limitations, Scaling, and Use-case Considerations

Despite substantial gains in efficiency and compactness, DeepSeek-R1-Distill-1.5B exhibits performance drops compared to the full R1 teacher. On MMLU (Formal Logic), the gap approaches 40–50 percentage points; on other advanced math tasks, the typical drop is 15–30 points, reflecting reduced maintenance of multi-agent or self-verification modules and limited active reasoning pathways (Jahin et al., 13 Mar 2025). A tradeoff emerges: 1.5B models deliver 16× speedup and ~10× efficiency gains suitable for real-time, edge, and low-cost deployment, but at the price of significantly lower ceiling performance in domains demanding extended, multi-path reasoning. Approaches such as skill-aware data selection, preference optimization, and code-integrated RL can moderate, but not eliminate, this gap.

Recommended deployments emphasize:

On-device tutoring or puzzle apps, FAQ reasoning bots, and initial-stage agentic filtering in hybrid LLM pipelines (Zhao et al., 16 Feb 2025).
Final answer generation or solution derivation for the most challenging inputs is best reserved for larger, reasoning-specialist teacher models.

References

"Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis" (Zhao et al., 16 Feb 2025)
"An Empirical Study on Eliciting and Improving R1-like Reasoning Models" (Chen et al., 6 Mar 2025)
"Value-Guided Search for Efficient Chain-of-Thought Reasoning" (Wang et al., 23 May 2025)
"Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts" (Zhang et al., 18 Mar 2025)
"RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability" (Zhang et al., 14 Apr 2025)
"Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation" (Zhang et al., 15 Jan 2026)
"Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models" (Yin et al., 3 Mar 2025)
"Teaching LLMs to Reason with Tools" (Li et al., 23 Oct 2025)
"Evaluating Mathematical Reasoning Across LLMs: A Fine-Grained Approach" (Jahin et al., 13 Mar 2025)
"When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator" (Anjum, 30 Apr 2025)