Papers
Topics
Authors
Recent
2000 character limit reached

DeepSeek-R1-Distill-Qwen-1.5B Reasoning Model

Updated 1 January 2026
  • The paper presents a 1.5B dense model distilled from a 37B RL-optimized teacher using chain-of-thought supervision to internalize advanced reasoning strategies.
  • It employs history-aware optimizations (HAPO) and adaptive reasoning (AutoThink) to significantly reduce token usage while preserving competitive accuracy.
  • Advanced tool integration (CoRT) and additional RL fine-tuning boost both computational efficiency and benchmark performance on complex math and logic tasks.

DeepSeek-R1-Distill-Qwen-1.5B is a 1.5-billion-parameter dense LLM distillation leveraging deep reinforcement learning and advanced chain-of-thought (CoT) supervision, purpose-built for efficient, high-fidelity mathematical and logical reasoning. It is derived from the Qwen-1.5B Transformer architecture and distilled from the RL-optimized DeepSeek-R1 teacher, transferring reasoning strategies to a compact, cost-effective student model. This article details its architecture, training pipeline, efficiency adaptations, empirical performance, evaluation methodologies, and key technical insights.

1. Model Architecture and Distillation Origins

DeepSeek-R1-Distill-Qwen-1.5B ("R1-Distill-1.5B") is based on the Qwen-1.5B decoder-only Transformer. The Qwen base model comprises 24 blocks, 2048 hidden dimension, and 16 attention heads, with maximum context up to 4,096 tokens. No architectural modifications are introduced in the distillation step; gains arise purely from distillation and alignment protocols (DeepSeek-AI et al., 22 Jan 2025, Chen et al., 6 Mar 2025, Sun et al., 5 Jun 2025).

The teacher, DeepSeek-R1, is a large-scale mixture-of-experts model (37B parameters), itself trained with multi-stage reinforcement learning (RL), and substantial supervised fine-tuning (SFT) on reasoning-oriented data. Chain-of-thought traces are generated by DeepSeek-R1 via rejection sampling, yielding diverse math, code, logic, and science examples that serve as the distillation corpus for the Qwen student (DeepSeek-AI et al., 22 Jan 2025).

2. Distillation and Reinforcement Learning Methodology

Sequence-Level Supervised Distillation

R1-Distill-1.5B is produced via supervised fine-tuning (SFT) on approximately 800,000 prompt–response pairs. These include 600k reasoning chains (math, coding, logic) and 200k "non-reasoning" (writing, factual QA, translation) from DeepSeek-V3 (DeepSeek-AI et al., 22 Jan 2025, Xu et al., 30 May 2025). The loss is standard cross-entropy over the token sequence output by the teacher:

LCE(θ)=(x,y)Dt=1ylogπθ(ytx,y<t)L_{CE}(\theta) = -\sum_{(x,y)\in D}\sum_{t=1}^{|y|} \log \pi_\theta(y_t|x, y_{<t})

The model tags outputs as |<reasoning>|...|</reasoning>| and |<summary>|...|<summary>|.

RL-Driven Teacher Processing

DeepSeek-R1 (the teacher) is trained with group relative policy optimization (GRPO). The reward function combines correctness, trace formatting, and language consistency. The objective is:

JGRPO(θ)=Eq,{oi}πθold[1Gi=1Gmin(riAi,  clip(ri,1ε,1+ε)Ai)βDKL(πθ(q)πref(q))],J_{GRPO}(\theta) = E_{q, \{o_i\} \sim \pi_{\theta_{\text{old}}}} \bigg[ \frac{1}{G}\sum_{i=1}^G \min(r_i A_i,\; \operatorname{clip}(r_i, 1-\varepsilon, 1+\varepsilon) A_i) - \beta D_{KL}(\pi_\theta(\cdot|q)\|\pi_{\text{ref}}(\cdot|q)) \bigg],

with rir_i a policy ratio, AiA_i the normalized advantage, and β\beta a KL penalty coefficient (DeepSeek-AI et al., 22 Jan 2025, Tu et al., 16 May 2025).

No RL is applied to the 1.5B student during standard distillation, but further RL-based fine-tuning can refine reasoning (see Section 4).

3. Efficiency-Optimized Reasoning: History-Aware, Adaptive, and Tool-Aided Approaches

HAPO: History-Aware Policy Optimization

HAPO introduces a per-problem “history state” hih_i, the minimum length among previously found correct traces. During RL, it shapes a reward combining formal correctness and history-conditioned length, using:

rlength(i(j),hi)={max(f(i(j),hi),c)if ai(j)=ai min(f(i(j),hi),0)if ai(j)ai 0hi is Nullr_{\mathrm{length}}(\ell_i^{(j)}, h_i) = \begin{cases} \max\left(f(\ell_i^{(j)}, h_i), c\right) & \text{if } a_i^{(j)} = a_i^* \ \min\left(f(\ell_i^{(j)}, h_i), 0\right) & \text{if } a_i^{(j)} \neq a_i^* \ 0 & h_i\ \text{is Null} \end{cases}

where f(,h)=cos(min(π2/h,π))f(\ell, h) = \cos(\min(\frac{\pi}{2} \ell/h, \pi)) and c=0.7c = -0.7 (Huang et al., 16 May 2025). HAPO cuts reasoning length by 33–59% at only a 2–5% accuracy drop.

AutoThink: Adaptive Thinking Mode Selection

AutoThink enables R1-Distill-1.5B to dynamically switch between explicit reasoning and direct answers. Seeding with an ellipsis ("...") after the > tag induces a Bernoulli process—stochastically choosing whether to elaborate. Multi-stage RL then rewards correct no-think on easy queries and explicit CoT on complex ones, using batch balancing and length-aware pruning in reward shaping. This yields a reduction in average token usage by 52% and a relative accuracy gain of 6.4% (Tu et al., 16 May 2025).

Integrated Code Interpreter Use (CoRT)

CoRT (Code-Optimized Reasoning Training) extends capability to leverage external Code Interpreters (CIs). A three-stage process—Hint-Engineering, RFT, and RL—results in the model learning to delegate complex computations to the CI as appropriate, penalizing invalid or redundant tool usage. On five math datasets, R1-Distill-1.5B achieves up to +8% absolute accuracy gain and 50% token reduction versus pure NL reasoning (Li et al., 23 Oct 2025).

4. Empirical Performance and Benchmarking

Math, Reasoning, and Agentic Discrimination

On standard math benchmarks (AIME’24, MATH-500):

Model AIME24 MATH-500 Avg Codeforces
DeepSeek-R1-Distill-1.5B 28.9% 83.9% N/A 954
QwQ-32B-Preview 50.0% 90.6% N/A 1316
GPT-4o-0513 9.3% 74.6% N/A 759

Table: Representative reasoning and code benchmarks (DeepSeek-AI et al., 22 Jan 2025).

On A-Eval-2.0 (Zhao et al., 16 Feb 2025):

  • Text understanding, extraction, and generation: tier D (<60)
  • Logical reasoning: tier C (60–70), outperforming non-distilled Qwen2.5-1.5B.

As a discriminator in LLM planning pipelines, R1-Distill-1.5B outperforms CodeLlama-13B on accuracy and F1 by up to 87%, despite its parameter disadvantage (Anjum, 30 Apr 2025).

Impact of RL Fine-Tuning

Further RL on R1-Distill-1.5B (STILL-3-1.5B) demonstrates substantial uplift, e.g., AIME 2024 pass@1 increases from 28.67% to 39.33% (+10.66 pp, +37%) (Chen et al., 6 Mar 2025). Key levers include on-policy PPO, high rollout temperature, and dynamic KL annealing.

Efficiency-Accuracy Tradeoffs

Under HAPO, token usage drops by 33–59% across GSM8K, MATH500, and AIME2024 at only 2–5% average accuracy loss (Huang et al., 16 May 2025). Flexible realignment strategies (TrRa, InRa) allow users to interpolate alignment/efficiency tradeoffs at both training and inference time (up to 55% fewer tokens at no accuracy loss) (Zhu et al., 15 Jun 2025). CoRT and AutoThink further boost efficiency by integrating tool usage and adaptive reasoning, respectively (Li et al., 23 Oct 2025, Tu et al., 16 May 2025).

5. Evaluation Protocols and Sources of Variance

Evaluation results for R1-Distill-1.5B are highly sensitive to protocol, sampling, and dataset version:

  • Pass@1 accuracy fluctuations >10 percentage points can result from prompt placement, seed, and sample count (Sun et al., 5 Jun 2025).
  • Image rendering (figures in math datasets), MCQ option ordering, and tensor parallelism all affect reproducibility.
  • Reproducible reporting requires full disclosure of temperature, top_p, context, N, dataset version, and seed settings, with confidence intervals, following the best-practice recommendations in (Sun et al., 5 Jun 2025).

6. Technical and Methodological Insights

  • History-aware and adaptive reasoning (HAPO, AutoThink) deliver strong length–accuracy tradeoffs not attainable by global constraints or per-query optimization (Huang et al., 16 May 2025, Tu et al., 16 May 2025).
  • Tool-use protocols (CoRT) resolve the inference-vs-computation tension, yielding both higher correctness and drastic efficiency improvements (Li et al., 23 Oct 2025).
  • As an agentic discriminator, R1-Distill-1.5B exploits fine-grained scoring from its CoT to outweigh larger but non-reasoning models—a key leverage point for agentic planning architectures (Anjum, 30 Apr 2025).
  • Recent research affirms advanced distillation protocols (e.g., REDI) can squeeze equivalent reasoning power from open data with just 1/6th the training set size, challenging the closed-data status of DeepSeek-R1-Distill-Qwen-1.5B (Xu et al., 30 May 2025).
  • Model scaling law holds: At fixed data and architecture, larger models outperform smaller ones, but advanced fine-tuning/clustering can close some of the gap for small models (Zhao et al., 16 Feb 2025, Xu et al., 9 Nov 2025).

7. Limitations and Future Directions

  • R1-Distill-1.5B’s gains are sharply concentrated in logical reasoning; it is notably weak in text extraction, generation, and multi-domain generalization (Zhao et al., 16 Feb 2025).
  • Naive length rewards degrade performance through reward hacking; correct reward shaping and careful RL stabilization (annealed KL, multi-stage curricula) are essential (Chen et al., 6 Mar 2025, Tu et al., 16 May 2025).
  • Evaluation instability remains a core concern. Performance should be reported with statistical rigor across seeds, generations, and dataset versions (Sun et al., 5 Jun 2025).
  • Advances such as diversity-driven distillation, MaxEnt-guided RL, and granular subdomain tracking (SSP, MGPO) may allow further improvements at tiny scale (Xu et al., 9 Nov 2025).

In summary, DeepSeek-R1-Distill-Qwen-1.5B represents the current apex of distilled 1.5B-parameter reasoning models: it efficiently internalizes chain-of-thought skills from much larger RL-optimized teachers, supports sophisticated adaptive and tool-assisted reasoning, and achieves near–state-of-the-art performance for its size across a diverse set of reasoning tasks, albeit with significant limitations outside those domains (DeepSeek-AI et al., 22 Jan 2025, Huang et al., 16 May 2025, Anjum, 30 Apr 2025, Li et al., 23 Oct 2025, Xu et al., 30 May 2025, Chen et al., 6 Mar 2025, Zhu et al., 15 Jun 2025, Zhao et al., 16 Feb 2025, Xu et al., 9 Nov 2025, Sun et al., 5 Jun 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Deepseek-R1-Distill-Qwen-1.5B.