DeepSeek-R1-Distill-Qwen-1.5B Model
- The paper presents a 1.5B dense language model distilled via supervised fine-tuning from DeepSeek-R1, inheriting advanced chain-of-thought reasoning capabilities.
- It demonstrates significant performance gains on benchmarks like AIME2024 and MATH-500, outperforming larger instruction-tuned models in logical and mathematical tasks.
- The model leverages 800,000 high-quality reasoning traces and innovative reward shaping techniques to achieve efficient, parameter-optimized reasoning for diverse applications.
DeepSeek-R1-Distill-Qwen-1.5B is a 1.5-billion-parameter dense LLM constructed via supervised distillation from the DeepSeek-R1 teacher, itself a reinforcement-learning-optimized reasoning LLM. This model, built on the Qwen2.5 backbone, is engineered to inherit advanced chain-of-thought reasoning abilities suitable for competitive mathematical, logical, and code tasks, while being sufficiently compact for resource-limited deployment. Its training, evaluation, and subsequent optimization illustrate core trends in the RL-for-reasoning LLM paradigm, efficiency-oriented distillation, and fine-grained reward shaping.
1. Origin and Distillation Methodology
DeepSeek-R1-Distill-Qwen-1.5B is produced by distilling the reasoning behaviors and chain-of-thought outputs of the DeepSeek-R1 model into the Qwen2.5-1.5B backbone through supervised fine-tuning (SFT) (DeepSeek-AI et al., 22 Jan 2025). DeepSeek-R1 itself emerges from a pipeline consisting of a cold-start phase (high-quality, long CoT SFT), RL (predominantly GRPO-based), and further SFT, producing reasoning chains delimited by tags such as >
and <answer>
.
The distillation leverages approximately 800,000 high-quality reasoning traces generated from DeepSeek-R1, sourced from challenging mathematical, logical, and programmatic tasks. No additional RL is performed during distillation at the 1.5B scale; SFT alone suffices to transfer chain-of-thought paradigms and reasoning structure. The Qwen series is favored as the student backbone for its alignment properties and, specifically in Qwen2.5, robust base reasoning skills.
2. Reasoning Capability and Performance Scaling
The primary contribution of DeepSeek-R1-Distill-Qwen-1.5B is the reliable transfer of complex multi-step reasoning to a parameter-efficient model. On logical reasoning benchmarks, such as AIME 2024 and MATH-500, the distilled 1.5B variant demonstrates substantial gains over math-focused or instruction-tuned baselines and even surpasses larger non-reasoning LLMs in discriminator-oriented tasks (Zhao et al., 16 Feb 2025, Anjum, 30 Apr 2025). For instance, distillation increased scores on general tasks for Qwen2.5-Math-1.5B by 178.74%.
Empirical results in several works show that while scaling the model parameter count (e.g., from 1.5B up to 32B/70B) yields monotonic accuracy improvements per established scaling laws, the benefits of reasoning-driven SFT disproportionately elevate smaller models, particularly when the training data is curated for difficulty and diversity (Zhao et al., 16 Feb 2025). Notably, the DeepSeek-R1-Distill-Qwen-1.5B model can achieve better discrimination accuracy than much larger models, thus inverting scale-driven performance expectations in evaluation-heavy or discriminator roles.
Performance across tasks:
Model AIME2024 MATH-500 LiveCodeBench GPQA-Diamond DeepSeek-R1-Distill-Qwen-1.5B 70.9 95.8 57.0 — Qwen3-235B-A22B-distill 79.4 93.9 59.6 — AM-Thinking-v1-distill 84.3 98.4 65.9 — Relative to its scale, DeepSeek-R1-Distill-Qwen-1.5B is competitive in logical reasoning and mathematical correctness.
3. Training Data Curation and Quality
The dataset for distillation includes high-quality, diversity-enhanced reasoning traces with rigorous verification: mathematical answers are checked via exact or epsilon-bound matching, code responses are validated using sandboxed test cases, and open QA traces are filtered through scoring reward models (Zhao et al., 25 Mar 2025). This data is structured with explicit
<think>
and<answer>
blocks, and often includes metadata and LaTeX-marked derivations, crucial for inducting stepwise, verifiable reasoning.Dataset diversity across domains (math, code, science, general chat) and semantic deduplication further mitigate data contamination and n-gram redundancy. However, comparative studies show that, although DeepSeek-R1-derived traces are valuable, datasets distilled from more length-diverse and lower-perplexity teachers (e.g., AM-Thinking-v1) yield superior adaptation and benchmark performance (Tian et al., 20 May 2025).
4. Task Specialization, Efficiency, and Reward Strategies
DeepSeek-R1-Distill-Qwen-1.5B is particularly effective when used as a discriminator in planning frameworks: its structured chain-of-thought traces can be parsed for soft-scoring candidate outputs (e.g., in text-to-SQL ranking), giving it pronounced advantages even over upscaled general LLMs (Anjum, 30 Apr 2025). For generation tasks, especially those requiring fluent or concise open-ended outputs, model performance is more modest, and further refinement may be necessary.
To address reasoning efficiency, several RL-based reward shaping techniques (LASER-D, HAPO, AutoThink, Flexible Realignment) have been applied post-distillation to reduce extraneous tokens and over-thinking:
- LASER-D: Dynamically adjusts target reasoning lengths by difficulty bucket, penalizing lengthy outputs for easy tasks and rewarding concise correct traces. Achieves up to +6.1 accuracy on AIME2024 while reducing tokens by 63% (Liu et al., 21 May 2025).
- HAPO: Tracks minimal correct response lengths in training history, using a length reward combined with correctness so that new solutions are shorter and errors are not over-penalized. Delivers 49% response compression at <2% accuracy loss (Huang et al., 16 May 2025).
- AutoThink: Employs a simple “…“ prompt to stochastically invoke explicit reasoning only as needed, with multi-stage RL to learn difficulty-aware switching between fast (no CoT) and slow (CoT) modes, yielding a 6.4% relative accuracy gain and 52% reduction in token usage (Tu et al., 16 May 2025).
- Flexible Realignment: Combines training-time distillation and inference-time adapters to control alignment between the reference and efficiently-aligned models via logit fusion, achieving up to 54.63% token reduction (Zhu et al., 15 Jun 2025).
These frameworks support adaptation across deployment scenarios—concise reasoning for lightweight inference or verbose step-by-step analysis for complex queries.
5. Safety Considerations and Alignment in Multilingual Contexts
Distillation from DeepSeek-R1 into smaller backbones can degrade safety alignment, particularly in non-English (e.g., Chinese) settings, as measured by risk content identification and refusal tasks. Comparative evaluation using the CHiSafetyBench indicates accuracy declines up to 30% in discrimination-related metrics after distillation (Zhang et al., 18 Mar 2025).
To remediate this, post-distillation safety tuning on a dedicated 50,000-sample corpus containing safety-critical instructions and chain-of-thought examples appreciably recovers the model’s risk aversion, refusal accuracy, and reduces harm rates, introducing minimal (<3%) reasoning performance loss. Safety-enhanced DeepSeek-R1-Distill-Qwen variants are open-sourced to encourage best practices in safe reasoning LLM deployment.
6. Evaluation Protocols and Reproducibility Challenges
Benchmarking DeepSeek-R1-Distill-Qwen-1.5B (and R1-style LLMs) is highly sensitive to seemingly minor changes in experimental setup: random seed, data versioning (e.g., image inclusion), prompt construction, and parallelism settings can each sway scores by several percentage points (Sun et al., 5 Jun 2025).
Best practices advocated include:
- Fixing seeds and reporting results over sufficient (N ≥ 64) repeated runs.
- Reporting confidence intervals for mean scores, using
to quantify error margins.
- Full documentation of dataset versions, prompt ordering, and implementation choices.
This rigor is paramount given that performance spreads due to noise can sometimes rival true model differences, particularly for distillation-optimized models with higher evaluation variance.
7. Practical Deployment and Comparative Limitations
The core strengths of DeepSeek-R1-Distill-Qwen-1.5B are its parameter efficiency, cost-effective reasoning performance, and adaptability to domain-specific discrimination roles. However, its relative ranking versus baseline instruction-tuned models is task-dependent:
- Logical reasoning and discrimination: Outscores larger non-reasoning LLMs (e.g., +87% F1 and +3.7% execution accuracy over CodeLlama-13B in text-to-SQL) (Anjum, 30 Apr 2025).
- Generation tasks: May underperform, especially if brevity or open-ended creativity is paramount; further fine-tuning leveraging more length-diverse, higher-quality reasoning traces (AM-Thinking-v1-derived) can close the gap (Tian et al., 20 May 2025).
- Schema-constrained tasks: ThinkJSON RL pipelines combining GRPO with format and structure-aware rewards enable robust schema adherence, at some accuracy cost relative to full-scale models (Agarwal et al., 18 Feb 2025).
- Safety: Post-distillation safety tuning is essential, especially for regulated or multilingual deployments.
Empirical results show that with advanced reward shaping and careful data curation, the DeepSeek-R1-Distill-Qwen-1.5B can serve both as an efficient solution for large-scale deployment and as a modular component in more complex planning or agentic frameworks.
References
- (DeepSeek-AI et al., 22 Jan 2025) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- (Zhao et al., 16 Feb 2025) Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis
- (Chen et al., 6 Mar 2025) An Empirical Study on Eliciting and Improving R1-like Reasoning Models
- (Zhao et al., 25 Mar 2025) 1.4 Million Open-Source Distilled Reasoning Dataset to Empower LLM Training
- (Zhang et al., 18 Mar 2025) Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts
- (Anjum, 30 Apr 2025) When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator
- (Tian et al., 20 May 2025) Not All Correct Answers Are Equal: Why Your Distillation Source Matters
- (Liu et al., 21 May 2025) Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
- (Huang et al., 16 May 2025) HAPO: Training LLMs to Reason Concisely via History-Aware Policy Optimization
- (Tu et al., 16 May 2025) Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
- (Wang et al., 14 Apr 2025) M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
- (Sun et al., 5 Jun 2025) Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design
- (Zhu et al., 15 Jun 2025) Flexible Realignment of LLMs
- (Agarwal et al., 18 Feb 2025) Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence
This suite of works reflects the central position of DeepSeek-R1-Distill-Qwen-1.5B as a baseline for efficient open-source reasoning-oriented LLMs and a substrate for further research in reward design, safety, and application-specific customization.