- The paper introduces a novel two-stage KD framework combining multi-agent data augmentation and efficient top-K white-box fusion to create lightweight LLMs.
- It demonstrates that distilled models from 0.5B to 7B parameters achieve improved instruction-following performance and significant speedups.
- Industrial cases, such as SQL completion and AI platform integration, validate the framework’s practical applicability and computational efficiency.
This paper introduces DistilQwen2.5, a family of open-source, lightweight LLMs derived from the Qwen2.5 series through knowledge distillation (KD) techniques aimed at industrial applications (2504.15027). The core problem addressed is the high computational cost and deployment difficulty of large LLMs, particularly in resource-constrained settings. The DistilQwen2.5 models (available in 0.5B, 1.5B, 3B, and 7B parameter sizes) demonstrate improved instruction-following capabilities compared to their original Qwen2.5 counterparts.
The methodology combines two stages of KD:
- Multi-Agent Data Augmentation (Black-Box KD): This stage uses powerful proprietary LLMs (like Qwen-max for Chinese and GPT-4/GPT-4o for other languages) acting as specialized agents coordinated by a controller. These agents perform:
- Expansion: Generating diverse instruction variations while preserving the original task category.
- Rewriting: Refining instructions and responses, crucially employing Chain-of-Thought (CoT) for complex reasoning tasks to enhance student model capabilities.
- Selection: Choosing high-value, informative, and task-balanced instruction-response pairs for training.
- Verification: Checking the factual correctness of generated data.
This augmented dataset, enriched with knowledge from teacher models, is used for initial supervised fine-tuning of the student models.
- Efficient Model Fusion (White-Box KD): Following black-box KD, this stage further refines the student models by transferring knowledge from the internal representations (logits) of larger white-box teacher models (Qwen2.5-Instruct 14B/32B/72B). To overcome industrial challenges like high GPU memory usage and vocabulary mismatches, the authors implement an efficient approach:
- Offline Logits Generation: Pre-compute and store only the top-K (default K=10) logits and their probabilities from the teacher model for the training dataset. This leverages the observation that most probability mass is concentrated in the top few tokens.
- Token Alignment: Handle potential vocabulary differences between teacher and student models.
Modified Divergence Loss: Calculate the KD loss (e.g., KLD) using only the top-K logits from the teacher and the corresponding logits from the student, significantly reducing computational and storage overhead. The probability calculation for the loss is normalized over the top-K logits:
pT=∑k=1Kexp(zT(k)/T)exp(zT/T), pS=∑k=1Kexp(zS(k)/T)exp(zS/T)
Implementation Details:
- Student Models: Qwen2.5-Instruct 0.5B, 1.5B, 3B, 7B.
- Teacher Models (Black-Box): Qwen-max, GPT-4/GPT-4o.
- Teacher Models (White-Box): Qwen2.5-Instruct 14B, 32B, 72B.
- Datasets: Initial data from OpenHermes 2.5, Cleaned Alpaca, LCCD, and in-house datasets, augmented via the multi-agent pipeline.
- Training: Adam optimizer, learning rate 1×10−5, 3 epochs, on 8x A800 GPUs.
- Logits Generation: The optimized top-K approach yielded a 3x-5x speedup compared to vanilla full-logits generation without performance degradation.
Evaluation and Results:
- DistilQwen2.5 models consistently outperformed the original Qwen2.5 models across benchmarks like AlpacaEval 2.0 (length-controlled), MT-Bench, and IFEval.
- The efficient model fusion (white-box KD) stage provided additional performance gains over just using black-box KD.
- Smaller models (e.g., 0.5B) showed proportionally larger improvements from distillation compared to larger ones (e.g., 7B).
- Analysis revealed diminishing returns for white-box KD when using excessively large teacher models or very large datasets (beyond ~100K samples).
Industrial Use Cases:
- SQL Completion: The KD framework was used to distill a 7B Qwen2.5 SQL completion model into a 3B version. The 3B student model achieved nearly identical performance (Pass@1, Adoption Rate) in online A/B tests on a big data platform but with a 1.4x inference speedup.
- AI Platform Integration: The KD process (data augmentation and distillation training pipelines) was integrated into a cloud AI platform, enabling users to perform continual KD on DistilQwen2.5 models for their specific domains.
The paper concludes that the proposed industrial practices, combining multi-agent black-box KD and efficient top-K white-box KD, effectively create high-performing lightweight LLMs suitable for real-world deployment. The DistilQwen2.5 models and the KD framework are presented as valuable resources for practitioners.