DistilQwen2.5: Efficient Instruction-Tuned LLMs

Updated 20 February 2026

DistilQwen2.5 is a family of instruction-tuned, lightweight models derived from Qwen2.5 using multi-agent knowledge distillation and white-box model fusion.
The models, ranging from 0.5B to 7B parameters, achieve enhanced computational efficiency, reduced deployment costs, and improved instruction-following performance.
They demonstrate practical gains in real-world use cases such as SQL completion and enterprise pipelines, while maintaining high evaluation metrics across benchmarks.

DistilQwen2.5 denotes a family of instruction-tuned lightweight LLMs derived through industrially motivated distillation from the public Qwen2.5 series. The models are architected for enhanced computational efficiency, reduced deployment costs, and superior instruction-following ability relative to original checkpoints of comparable size. DistilQwen2.5 integrates black-box multi-agent knowledge distillation for data augmentation with a computationally optimized white-box model fusion stage, resulting in open-source models ranging from 0.5B to 7B parameters that outperform their non-distilled counterparts on a wide spectrum of evaluation tasks (Wang et al., 21 Apr 2025).

1. Model Architecture and Distillation Sources

The DistilQwen2.5 family begins with Qwen2.5–Instruct models at four scales: 0.5B (24 layers, hidden size 1024), 1.5B (36, 2048), 3B (48, 3072), and 7B (56, 4096) parameters. The distillation process strategically utilizes larger Qwen2.5–Instruct models (14B, 32B, and 72B parameters) as white-box teachers to infuse smaller models with fine-grained hidden knowledge via model fusion. All resulting models, code, and datasets are made available through open-source channels, facilitating broad accessibility and research (Wang et al., 21 Apr 2025).

2. Multi-Agent Black-Box Knowledge Distillation Pipeline

Central to DistilQwen2.5 is a sophisticated instruction–response (I-R) data construction protocol employing multiple proprietary and public LLMs as black-box teachers:

Seed Formation: ~2M I-R pairs are aggregated from OpenHermes-2.5, Cleaned Alpaca, LCCD, and in-house sources, processed by deduplication, normalization, and length filtering.
Multi-Agent Expansion:
- Expansion agent generates 3–5 paraphrases per instruction while preserving task categories.
- Rewriting agent crafts tight paraphrases with Chain-of-Thought (CoT) styles for reasoning and code.
- Selection agent filters by informativeness and task-balance, retaining the top 60% per category.
- Verification agent conducts factuality checks via API calls or knowledge bases.
Outcome: Final black-box KD dataset of ~4M rigorously verified I-R pairs (Wang et al., 21 Apr 2025).

This protocol supports robust generalization and instruction-following through carefully curated and diversified training data.

3. Distillation Objectives and Model Fusion

DistilQwen2.5 employs a two-stage training objective: supervised cross-entropy loss during black-box KD and a KL-divergence-based white-box fusion loss.

Black-Box KD (Supervised Learning):

$L_{CE}(\theta) = -\mathbb{E}_{(x, y)} \sum_{n=1}^L \log p_S^\theta(y_n \mid y_{<n}, x)$

White-Box Model Fusion (Teacher-Student KL):
- At each token position, the student’s prediction is encouraged to match the teacher’s over the top-K logits ( $K=10$ , $T=2$ ):
$D_\theta(x, y) = \frac{1}{L} \sum_{n=1}^L \mathrm{KL}(p_T(\cdot \mid ...), p_S(\cdot \mid ...))$

$L_{KD}(\theta) = \mathbb{E}_{(x, y)} D_\theta(x,y)$

$L_{\text{total}}(\theta) = \alpha L_{CE}(\theta) + \beta L_{KD}(\theta)$

with $\alpha = 1.0$ , $\beta = 1.0$ .

The model fusion pipeline decouples expensive teacher forward passes via offline storage of top-K logits, achieves 3×–5× acceleration in KD efficiency for models ≥32B, and maintains ≥99% of distillation effectiveness (Wang et al., 21 Apr 2025).

4. Training Strategy and Optimization

Fine-tuning is performed on 8× NVIDIA A800 (80GB) GPUs using the AdamW optimizer (weight decay 0.01, learning rate $1 \times 10^{-5}$ with linear warmup and cosine decay). Each stage (black-box KD and fusion) lasts 3 epochs with batch sizes of 128 sequences (dynamic length bucketing), employing FP16 mixed precision and NVIDIA’s APEX library.

5. Benchmarking and Experimental Results

DistilQwen2.5 advances instruction-following, dialog, and dynamic interaction performance across several academic benchmarks. Table 1 encapsulates improvements over baseline Qwen2.5:

Model	AlpacaEval	MT-Bench (Full)	MT-Bench (Single)	IFEval (Loose)	IFEval (Strict)
Qwen2.5-0.5B	2.46	5.49	6.26	42.81	30.31
DistilQwen2.5-0.5B	4.89	5.78	6.83	52.61	37.82
Qwen2.5-1.5B	6.69	7.09	7.66	55.40	40.11
DistilQwen2.5-1.5B	13.69	7.35	7.99	61.10	74.49
Qwen2.5-3B	17.98	7.92	8.40	61.18	74.58
DistilQwen2.5-3B	20.91	8.37	8.97	67.03	77.36
Qwen2.5-7B	31.43	8.52	8.83	81.53	72.10
DistilQwen2.5-7B	34.86	8.76	9.22	83.48	73.27

DistilQwen2.5 achieves notable task-specific gains (e.g., in writing, reasoning, extraction on MT-Bench) for smaller student sizes. Inference throughput is further improved by 3×–5× via top-K teacher logit caching (Wang et al., 21 Apr 2025).

6. Practical Use Cases and Deployment

DistilQwen2.5 models have demonstrated effectiveness in real-world domains:

Big Data Platform SQL Completion: DistilQwen2.5-3B achieves 2.6× latency reduction (148 ms) versus Qwen2.5-7B (384 ms) with near-identical Pass@1 and adoption rates.
Enterprise Pipelines: The Knowledge Production Pipeline (KPP) streamlines instruction data ingestion and curation, while the Distillation Training Pipeline (DTP) orchestrates distributed KD and evaluation across cloud infrastructure (Wang et al., 21 Apr 2025).

These capabilities enable deployment in latency-sensitive, resource-constrained, or cost-sensitive settings where larger LLMs are impractical.

7. Position Relative to Model Compression and Quantization Research

DistilQwen2.5 complements other efficient LLM paradigms such as iterative layer-wise distillation—where explicit per-layer pruning and recovery allows for further parameter reduction with minimal quality loss (Kovalev et al., 7 Nov 2025)—and extreme quantization approaches (e.g., BitNet Distillation to 1.58-bit models (Wu et al., 15 Oct 2025)). An implication is that DistilQwen2.5’s methodology, especially its rigorous task-driven data augmentation and model fusion, may serve as a best-practice template for broader instruction-tuning and resource-aware LLM development in both academic and industrial environments.

References:

“DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight LLMs” (Wang et al., 21 Apr 2025)
“Iterative Layer-wise Distillation for Efficient Compression of LLMs” (Kovalev et al., 7 Nov 2025)
“BitNet Distillation” (Wu et al., 15 Oct 2025)

Markdown Report Issue Upgrade to Chat

References (3)

DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models (2025)

Iterative Layer-wise Distillation for Efficient Compression of Large Language Models (2025)

BitNet Distillation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DistilQwen2.5.