Compact LLMs via Pruning and Knowledge Distillation
The paper "Compact LLMs via Pruning and Knowledge Distillation" by Muralidharan et al. addresses the computational challenges associated with training LLMs by proposing an alternative approach to producing variants of an LLM through structured pruning and subsequent knowledge distillation. The authors establish a set of best practices for effectively compressing large pre-trained models like Nemotron-4 15B into smaller, yet more efficient models, named Minitron.
The key contributions of the paper include:
- Empirical Exploration of Pruning Strategies:
The authors extensively explore various structured pruning strategies to understand how models respond to pruning across different dimensions such as depth (number of layers), width (number of neurons, attention heads, and embedding channels), and combinations thereof. Their empirical paper revealed non-trivial insights, notably: - Width pruning was generally more effective than depth pruning. - Combining width and depth pruning yielded superior results after a retraining step.
- Lightweight Retraining through Knowledge Distillation: The paper highlights the effectiveness of using knowledge distillation for lightweight retraining of pruned models. By using a combination of the teacher model's logits and intermediate states, the authors demonstrate significant improvements in accuracy recovery post-pruning. This approach proved more computationally and data-efficient when compared to conventional retraining methods.
- Cost-Efficiency in Model Compression: One of the standout results is the significant reduction in computational cost and data requirements for training smaller models. The Minitron models—derived from the 15B parameter Nemotron-4—achieved competitive performance with up to 40 times fewer training tokens compared to models trained from scratch. Notably, the Minitron 8B model outperformed several similarly-sized models, such as Mistral 7B and LLaMa-2 7B, demonstrating the practical benefits of the proposed methodology.
Pruning Methodology
The authors' pruning approach begins with the computation of the importance of each component (neurons, heads, layers) within the model using a batch of calibration data. They use activation-based metrics for this purpose. For instance, they compute the importance of each attention head based on its activations. Depth pruning is informed by metrics like perplexity and block importance.
Retraining with Knowledge Distillation
Retraining a pruned model involves using knowledge distillation where the pruned model (student) learns from the original model (teacher). The distillation process involves mimicking the teacher model's output distribution and intermediate states. The total loss function combines the cross-entropy loss with the ground truth and KD losses from the logits and intermediate states.
Experimental Results
The paper provides comprehensive experimental results:
- Minitron 8B: This model achieved higher accuracy compared to the Nemotron-3 8B model and was on par with several state-of-the-art community models. The training tokens required for Minitron 8B were significantly lower, emphasizing the efficiency of the proposed method.
- Minitron 4B: This smaller model retained competitive capabilities and outperformed some existing models despite its size.
Best Practices for Structured Compression
The authors derive a set of best practices from their experiments:
- Train the largest model first, then prune and distill to obtain smaller models.
- Apply specific importance estimation techniques, with preferences for width over depth pruning.
- Use knowledge distillation exclusively for retraining rather than conventional methods.
- Perform iterative pruning and lightweight retraining to stabilize rankings of pruned candidates.
- Prune models at the final stage of multi-phase training strategies to retain model capabilities effectively.
Implications and Future Work
The practical implications of this research are significant:
- Cost Reduction: The efficient training of a family of models using fewer resources.
- Performance: Competitive performance with reduced computational and data costs.
- Scalability: Applicability to various architectures and scales of models.
Theoretically, this research opens avenues for further exploration into the nuances of pruning strategies and the role of knowledge distillation in different model architectures.
The authors hint at potential future directions, such as the use of parameter-efficient fine-tuning techniques like LoRA during the retraining stage and extending their approach to instruction-tuned models.
Overall, this paper systematically addresses the inefficiencies in training LLMs by combining structured pruning with effective retraining through knowledge distillation, leading to significant advancements in the cost-efficiency and performance of smaller LLMs.