DeepSeek-Distill-Qwen-1.5B Overview

Updated 4 October 2025

DeepSeek-Distill-Qwen-1.5B is a dense 1.5B transformer model that distills complex chain-of-thought reasoning from larger DeepSeek-R1 models.
It employs a multi-stage pipeline—with supervised fine-tuning, reinforcement learning, and rejection sampling—to optimize logical and mathematical reasoning.
Empirical evaluations show substantial improvements on benchmarks like AIME 2024 and text-to-SQL tasks, achieving higher accuracy with lower computational costs.

DeepSeek-Distill-Qwen-1.5B is a dense, 1.5-billion-parameter transformer-based LLM produced via a multi-stage pipeline for distilling advanced chain-of-thought (CoT) reasoning capabilities from larger DeepSeek-R1 models into a compact, efficient architecture. It is derived from the Qwen2.5-Math-1.5B backbone and designed to maintain strong logical and mathematical reasoning performance with significantly lower inference costs and memory requirements than its larger teacher models.

1. Model Architecture and Distillation Pipeline

DeepSeek-Distill-Qwen-1.5B inherits the standard transformer architecture from Qwen2.5-Math-1.5B, preserving dense-attention stacks without introducing sparsity or mixture-of-experts (MoE) topologies (Zhao et al., 16 Feb 2025). The output probability distribution is formally given by

$P(y | x; \theta) = \operatorname{softmax}(f(x; \theta)),$

where $x$ denotes the input sequence and $\theta$ the model parameters.

The training pipeline to produce DeepSeek-Distill-Qwen-1.5B comprises:

Initial Instruction Tuning: Supervised fine-tuning (SFT) on curated instruction datasets with a focus on mathematical reasoning, aligning the base Qwen2.5-Math-1.5B to complex, multi-step problem-solving tasks.
Multi-Stage Distillation:
- Cold-Start SFT: The model is first exposed to thousands of highly-structured, human-aligned chain-of-thought traces, emphasizing readability and formatting consistency.
- Reinforcement Learning (GRPO):
DeepSeek-R1 trains with Group Relative Policy Optimization (GRPO), where the objective for a group of outputs $\{o_i\}$ and associated rewards $r_i$ is:

$\mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim P(Q),\, \{o_i\} \sim \pi_{\theta_{\mathrm{old}}}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\mathrm{old}}}(o_i|q)} \, A_i, \mathrm{clip}\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\mathrm{old}}}(o_i|q)}, 1-\epsilon, 1+\epsilon\right) \, A_i \right) - \beta \mathbb{D}_{KL}(\pi_\theta \| \pi_{\mathrm{ref}}) \right],$

where $A_i = \frac{r_i - \mathrm{mean}(r_{1:G})}{\mathrm{std}(r_{1:G})}$ , $\epsilon, \beta$ are hyperparameters, and $\pi_{\mathrm{ref}}$ a reference policy (DeepSeek-AI et al., 22 Jan 2025). - Rejection Sampling & SFT: Outputs are filtered to collect clean, high-quality reasoning traces (around 800k samples), and a further SFT round aligns the distilled model on these filtered sequences.
Distillation: The teacher data (the 800k DeepSeek-R1-generated traces) guides supervised distillation, transferring complex reasoning behaviors to the student Qwen-1.5B without further RL.

A summary of the architecture and pipeline is:

Stage	Data/Objective	Method
Instruction-Tuning	Math/logic-focused tasks	SFT on curated data
Teacher Training	Chain-of-thought, correct formatting	RL (GRPO) + SFT on outputs
Distillation	800k teacher traces	SFT on distilled dataset

2. Reasoning Capabilities and Efficiency

The distilled model exhibits significant gains on mathematical and logical reasoning benchmarks relative to base models of similar size (Zhao et al., 16 Feb 2025, Anjum, 30 Apr 2025). For example, the distillation process can yield up to a 178.74% improvement on complex math tasks (in coverage-normalized benchmark scores) over the original Qwen2.5-Math-1.5B (Zhao et al., 16 Feb 2025). On the AIME 2024 benchmark, targeted RL fine-tuning post-distillation improves accuracy further from ~29% to 39.33% (Chen et al., 6 Mar 2025). The model supports detailed step-by-step CoT traces, with output format and language mixing issues largely corrected by cold-start data and curriculum.

Notably, DeepSeek-Distill-Qwen-1.5B can outperform non-reasoning models 5–10 times its size when used as a discriminator within LLM planning frameworks. For instance, it achieves up to 87% higher macro F1 and a 3.7% higher discrimination accuracy than CodeLlama-7B in text-to-SQL candidate evaluation, and 3.7% higher execution accuracy than CodeLlama-13B (Anjum, 30 Apr 2025). The model generates structured, introspective chains-of-thought and supports JSON-formatted answer extraction, facilitating soft-score computation for candidate ranking.

3. Practical Training Strategies and Enhancement Techniques

Several recent methods further refine DeepSeek-Distill-Qwen-1.5B's reasoning efficiency and application profile.

ShorterBetter (Efficient RL for Output Length): By introducing a Sample Optimal Length (SOL) reward, the model is discouraged from producing unnecessarily long (or "over-thought") reasoning traces. The RL reward function is:

$r_i = \alpha \cdot \mathbb{1}(\hat{y}_i = y_{\mathrm{gt}}) - \beta \cdot |\ell^*(G) - \ell(o_i)|$

where $\ell^*(G)$ is the SOL, and $\alpha,\beta$ are tuning parameters. This yields 50–80% reduction in average output length with maintained or improved accuracy (Yi et al., 30 Apr 2025).

LASER-D (Adaptive, Difficulty-aware Reward): The model receives correctness bonuses only when correct responses fall within a dynamically computed length threshold $L_A$ , adjusted per problem difficulty bucket. This results in a 63% reduction in token usage and a +6.1 score improvement on AIME2024 (Liu et al., 21 May 2025).
HAPO (History-Aware Policy Optimization): By tracking the minimum length of correct responses per problem through training epochs, the model uses a cosine-based reward function to encourage conciseness compared to historical bests, achieving up to 49% reduction in output length for DeepSeek-Distill-Qwen-1.5B with negligible accuracy loss (Huang et al., 16 May 2025).
Flexible Realignment (TrRa and InRa): Training-time and inference-time mechanisms interpolate between base and aligned model logits (controlled by a parameter $\lambda$ ) to optimize the tradeoff between alignment and depth. TrRa reduces token usage by up to 54.63% without performance degradation (Zhu et al., 15 Jun 2025).
GRESO (Efficient RL via Selective Rollouts): Avoids rollout computation for prompts that consistently yield zero-variance (informative) reward groups, achieving up to 2.4× training speedup without accuracy degradation (Zheng et al., 2 Jun 2025).

4. Applications and Empirical Performance

DeepSeek-Distill-Qwen-1.5B is primarily targeted at reasoning-intensive domains:

Mathematical Problem Solving: Outperforms prior 1.5B models on MATH500, AIME24, OlympiadBench; achieving notable accuracy for AIME 2024 (up to 39.33% with RL) (Chen et al., 6 Mar 2025). When further distilled with reinforcement objectives utilizing both positive and negative reasoning traces (e.g., REDI), passes 83.1% on MATH500 using significantly less data (Xu et al., 30 May 2025).
Agentic and Planning Frameworks: Serves as a highly effective discriminator (not generator) in text-to-SQL LLM planning, providing fine-grained candidate rankings via introspective chain-of-thought and confidence-calibrated JSON scores (Anjum, 30 Apr 2025).
Schema-Constrained Generation: RL+SFT pipelines (e.g., ThinkJSON) enable the model to enforce strict adherence to output schemas (e.g., JSON) for regulated industrial tasks, achieving syntactic validity as well as semantic matching (Agarwal et al., 18 Feb 2025).

The model, however, generally underperforms on broader natural language tasks (e.g., text understanding, open-ended generation), maintaining "D" or "C" tier ratings in A-Eval's diverse task suite after distillation, compared to larger models (Zhao et al., 16 Feb 2025).

5. Limitations, Evaluation Methodology, and Alignment

Evaluation Sensitivity and Reproducibility

Benchmark results for DeepSeek-Distill-Qwen-1.5B are sensitive to seed initialization, sample count, dataset variants, and prompt formatting, with fluctuations of several percentage points common across repeated trials. For reliable reporting, a rigorous evaluation regime is prescribed, including: $N \geq \left( z_{\alpha/2} \frac{s}{E} \right)^2$ for chosen confidence intervals and error margins, and explicit documentation of all parameters (seed, N, dataset variant, instruction position, TP settings) (Sun et al., 5 Jun 2025).

Scaling Behavior and Role Assignment

While the distilled model closes much of the reasoning gap to much larger non-reasoning models, there remain strict upper limits to its chain-of-thought depth; gains plateau beyond 1024 tokens in context, and excessive input sometimes yields redundant or degraded outputs (Anjum, 30 Apr 2025). The model's optimal deployment role is as a discriminator/ranker rather than a generator of domain solutions.

Alignment and Transparency

Mechanistic probing reveals that deception or misalignment signals are not easily separable in 1.5B models but become more detectable at larger scales. For DeepSeek-Distill-Qwen-1.5B, linear probe accuracy for deception detection is at chance, whereas larger reasoning-focused models show >90% accuracy (Boxo et al., 27 Aug 2025). This indicates challenges for direct interpretability and instrumentation at smaller scales.

6. Comparative Models and Design Trade-offs

M1 Mamba RNNs: M1 (RNN-based, Mamba architecture) achieves similar reasoning accuracy to DeepSeek-Distill-Qwen-1.5B but with >3× test-time throughput, allowing increased sample generation for self-consistency voting under fixed compute budgets; this is an alternative path for efficient reasoning LLMs (Wang et al., 14 Apr 2025).
DistilQwen2.5 Series: Extensive multi-agent data augmentation, model fusion, and top-K teacher logit alignment produce models competitive in instruction following and coding, especially where compute constraints are primary (Wang et al., 21 Apr 2025).
SATURN and Curriculum RL: Using SAT-based curriculum RL, DeepSeek-Distill-Qwen-1.5B can be continually pushed to improve logical reasoning, with measured +14.0 average pass@3 gains on SAT and +4.9 on math/programming benchmarks (Liu et al., 22 May 2025).

7. Broader Research Directions

DeepSeek-Distill-Qwen-1.5B demonstrates that distilled reasoning models can achieve strong practical performance for targeted domains, especially as discriminators and evaluators in agentic workflows. Ongoing research focuses on:

Offline reinforcement distillation that leverages negative as well as positive teacher traces (e.g., REDI) to maximize the efficiency of knowledge transfer (Xu et al., 30 May 2025).
Adaptive reward shaping (ShorterBetter, LASER-D, HAPO) for efficient, non-redundant reasoning without manual intervention (Yi et al., 30 Apr 2025, Huang et al., 16 May 2025, Liu et al., 21 May 2025).
Mechanistic alignment tools for real-time monitoring of model behavior in high-stakes deployments (Boxo et al., 27 Aug 2025).
Flexible, on-demand realignment at inference or training to permit both "fast" and "slow" thinking as task demands vary (Zhu et al., 15 Jun 2025).

In sum, DeepSeek-Distill-Qwen-1.5B embodies the current synthesis of efficiency, scaled chain-of-thought reasoning, and systematic model distillation in open-source LLM research, with a rapidly evolving suite of methods to address reasoning efficiency, task-specific adaptation, and rigorous empirical validation.