Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

DeepSeek-R1-Distill-Qwen-1.5B Model

Updated 22 September 2025
  • The paper presents a 1.5B dense language model distilled via supervised fine-tuning from DeepSeek-R1, inheriting advanced chain-of-thought reasoning capabilities.
  • It demonstrates significant performance gains on benchmarks like AIME2024 and MATH-500, outperforming larger instruction-tuned models in logical and mathematical tasks.
  • The model leverages 800,000 high-quality reasoning traces and innovative reward shaping techniques to achieve efficient, parameter-optimized reasoning for diverse applications.

DeepSeek-R1-Distill-Qwen-1.5B is a 1.5-billion-parameter dense LLM constructed via supervised distillation from the DeepSeek-R1 teacher, itself a reinforcement-learning-optimized reasoning LLM. This model, built on the Qwen2.5 backbone, is engineered to inherit advanced chain-of-thought reasoning abilities suitable for competitive mathematical, logical, and code tasks, while being sufficiently compact for resource-limited deployment. Its training, evaluation, and subsequent optimization illustrate core trends in the RL-for-reasoning LLM paradigm, efficiency-oriented distillation, and fine-grained reward shaping.

1. Origin and Distillation Methodology

DeepSeek-R1-Distill-Qwen-1.5B is produced by distilling the reasoning behaviors and chain-of-thought outputs of the DeepSeek-R1 model into the Qwen2.5-1.5B backbone through supervised fine-tuning (SFT) (DeepSeek-AI et al., 22 Jan 2025). DeepSeek-R1 itself emerges from a pipeline consisting of a cold-start phase (high-quality, long CoT SFT), RL (predominantly GRPO-based), and further SFT, producing reasoning chains delimited by tags such as > and <answer>.

The distillation leverages approximately 800,000 high-quality reasoning traces generated from DeepSeek-R1, sourced from challenging mathematical, logical, and programmatic tasks. No additional RL is performed during distillation at the 1.5B scale; SFT alone suffices to transfer chain-of-thought paradigms and reasoning structure. The Qwen series is favored as the student backbone for its alignment properties and, specifically in Qwen2.5, robust base reasoning skills.

2. Reasoning Capability and Performance Scaling

The primary contribution of DeepSeek-R1-Distill-Qwen-1.5B is the reliable transfer of complex multi-step reasoning to a parameter-efficient model. On logical reasoning benchmarks, such as AIME 2024 and MATH-500, the distilled 1.5B variant demonstrates substantial gains over math-focused or instruction-tuned baselines and even surpasses larger non-reasoning LLMs in discriminator-oriented tasks (Zhao et al., 16 Feb 2025, Anjum, 30 Apr 2025). For instance, distillation increased scores on general tasks for Qwen2.5-Math-1.5B by 178.74%.

Empirical results in several works show that while scaling the model parameter count (e.g., from 1.5B up to 32B/70B) yields monotonic accuracy improvements per established scaling laws, the benefits of reasoning-driven SFT disproportionately elevate smaller models, particularly when the training data is curated for difficulty and diversity (Zhao et al., 16 Feb 2025). Notably, the DeepSeek-R1-Distill-Qwen-1.5B model can achieve better discrimination accuracy than much larger models, thus inverting scale-driven performance expectations in evaluation-heavy or discriminator roles.

Performance across tasks:

Model AIME2024 MATH-500 LiveCodeBench GPQA-Diamond
DeepSeek-R1-Distill-Qwen-1.5B 70.9 95.8 57.0 —
Qwen3-235B-A22B-distill 79.4 93.9 59.6 —
AM-Thinking-v1-distill 84.3 98.4 65.9 —

Relative to its scale, DeepSeek-R1-Distill-Qwen-1.5B is competitive in logical reasoning and mathematical correctness.

3. Training Data Curation and Quality

The dataset for distillation includes high-quality, diversity-enhanced reasoning traces with rigorous verification: mathematical answers are checked via exact or epsilon-bound matching, code responses are validated using sandboxed test cases, and open QA traces are filtered through scoring reward models (Zhao et al., 25 Mar 2025). This data is structured with explicit <think> and <answer> blocks, and often includes metadata and LaTeX-marked derivations, crucial for inducting stepwise, verifiable reasoning.

Dataset diversity across domains (math, code, science, general chat) and semantic deduplication further mitigate data contamination and n-gram redundancy. However, comparative studies show that, although DeepSeek-R1-derived traces are valuable, datasets distilled from more length-diverse and lower-perplexity teachers (e.g., AM-Thinking-v1) yield superior adaptation and benchmark performance (Tian et al., 20 May 2025).

4. Task Specialization, Efficiency, and Reward Strategies

DeepSeek-R1-Distill-Qwen-1.5B is particularly effective when used as a discriminator in planning frameworks: its structured chain-of-thought traces can be parsed for soft-scoring candidate outputs (e.g., in text-to-SQL ranking), giving it pronounced advantages even over upscaled general LLMs (Anjum, 30 Apr 2025). For generation tasks, especially those requiring fluent or concise open-ended outputs, model performance is more modest, and further refinement may be necessary.

To address reasoning efficiency, several RL-based reward shaping techniques (LASER-D, HAPO, AutoThink, Flexible Realignment) have been applied post-distillation to reduce extraneous tokens and over-thinking:

  • LASER-D: Dynamically adjusts target reasoning lengths by difficulty bucket, penalizing lengthy outputs for easy tasks and rewarding concise correct traces. Achieves up to +6.1 accuracy on AIME2024 while reducing tokens by 63% (Liu et al., 21 May 2025).
  • HAPO: Tracks minimal correct response lengths in training history, using a length reward combined with correctness so that new solutions are shorter and errors are not over-penalized. Delivers 49% response compression at <2% accuracy loss (Huang et al., 16 May 2025).
  • AutoThink: Employs a simple “…“ prompt to stochastically invoke explicit reasoning only as needed, with multi-stage RL to learn difficulty-aware switching between fast (no CoT) and slow (CoT) modes, yielding a 6.4% relative accuracy gain and 52% reduction in token usage (Tu et al., 16 May 2025).
  • Flexible Realignment: Combines training-time distillation and inference-time adapters to control alignment between the reference and efficiently-aligned models via logit fusion, achieving up to 54.63% token reduction (Zhu et al., 15 Jun 2025).

These frameworks support adaptation across deployment scenarios—concise reasoning for lightweight inference or verbose step-by-step analysis for complex queries.

5. Safety Considerations and Alignment in Multilingual Contexts

Distillation from DeepSeek-R1 into smaller backbones can degrade safety alignment, particularly in non-English (e.g., Chinese) settings, as measured by risk content identification and refusal tasks. Comparative evaluation using the CHiSafetyBench indicates accuracy declines up to 30% in discrimination-related metrics after distillation (Zhang et al., 18 Mar 2025).

To remediate this, post-distillation safety tuning on a dedicated 50,000-sample corpus containing safety-critical instructions and chain-of-thought examples appreciably recovers the model’s risk aversion, refusal accuracy, and reduces harm rates, introducing minimal (<3%) reasoning performance loss. Safety-enhanced DeepSeek-R1-Distill-Qwen variants are open-sourced to encourage best practices in safe reasoning LLM deployment.

6. Evaluation Protocols and Reproducibility Challenges

Benchmarking DeepSeek-R1-Distill-Qwen-1.5B (and R1-style LLMs) is highly sensitive to seemingly minor changes in experimental setup: random seed, data versioning (e.g., image inclusion), prompt construction, and parallelism settings can each sway scores by several percentage points (Sun et al., 5 Jun 2025).

Best practices advocated include:

  • Fixing seeds and reporting results over sufficient (N ≥ 64) repeated runs.
  • Reporting confidence intervals for mean scores, using

μ±zα/2⋅(s/N)\mu \pm z_{\alpha/2} \cdot (s/\sqrt{N})

to quantify error margins.

  • Full documentation of dataset versions, prompt ordering, and implementation choices.

This rigor is paramount given that performance spreads due to noise can sometimes rival true model differences, particularly for distillation-optimized models with higher evaluation variance.

7. Practical Deployment and Comparative Limitations

The core strengths of DeepSeek-R1-Distill-Qwen-1.5B are its parameter efficiency, cost-effective reasoning performance, and adaptability to domain-specific discrimination roles. However, its relative ranking versus baseline instruction-tuned models is task-dependent:

  • Logical reasoning and discrimination: Outscores larger non-reasoning LLMs (e.g., +87% F1 and +3.7% execution accuracy over CodeLlama-13B in text-to-SQL) (Anjum, 30 Apr 2025).
  • Generation tasks: May underperform, especially if brevity or open-ended creativity is paramount; further fine-tuning leveraging more length-diverse, higher-quality reasoning traces (AM-Thinking-v1-derived) can close the gap (Tian et al., 20 May 2025).
  • Schema-constrained tasks: ThinkJSON RL pipelines combining GRPO with format and structure-aware rewards enable robust schema adherence, at some accuracy cost relative to full-scale models (Agarwal et al., 18 Feb 2025).
  • Safety: Post-distillation safety tuning is essential, especially for regulated or multilingual deployments.

Empirical results show that with advanced reward shaping and careful data curation, the DeepSeek-R1-Distill-Qwen-1.5B can serve both as an efficient solution for large-scale deployment and as a modular component in more complex planning or agentic frameworks.

References

This suite of works reflects the central position of DeepSeek-R1-Distill-Qwen-1.5B as a baseline for efficient open-source reasoning-oriented LLMs and a substrate for further research in reward design, safety, and application-specific customization.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DeepSeek-R1-Distill-Qwen-1.5B.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube