Compact Language Models via Pruning and Knowledge Distillation (2407.14679v2)

Published 19 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of LLMing tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

PDF HTML Abstract

Compact LLMs via Pruning and Knowledge Distillation

The paper "Compact LLMs via Pruning and Knowledge Distillation" by Muralidharan et al. addresses the computational challenges associated with training LLMs by proposing an alternative approach to producing variants of an LLM through structured pruning and subsequent knowledge distillation. The authors establish a set of best practices for effectively compressing large pre-trained models like Nemotron-4 15B into smaller, yet more efficient models, named Minitron.

The key contributions of the paper include:

Empirical Exploration of Pruning Strategies:

The authors extensively explore various structured pruning strategies to understand how models respond to pruning across different dimensions such as depth (number of layers), width (number of neurons, attention heads, and embedding channels), and combinations thereof. Their empirical paper revealed non-trivial insights, notably: - Width pruning was generally more effective than depth pruning. - Combining width and depth pruning yielded superior results after a retraining step.

Lightweight Retraining through Knowledge Distillation: The paper highlights the effectiveness of using knowledge distillation for lightweight retraining of pruned models. By using a combination of the teacher model's logits and intermediate states, the authors demonstrate significant improvements in accuracy recovery post-pruning. This approach proved more computationally and data-efficient when compared to conventional retraining methods.
Cost-Efficiency in Model Compression: One of the standout results is the significant reduction in computational cost and data requirements for training smaller models. The Minitron models—derived from the 15B parameter Nemotron-4—achieved competitive performance with up to 40 times fewer training tokens compared to models trained from scratch. Notably, the Minitron 8B model outperformed several similarly-sized models, such as Mistral 7B and LLaMa-2 7B, demonstrating the practical benefits of the proposed methodology.

Pruning Methodology

The authors' pruning approach begins with the computation of the importance of each component (neurons, heads, layers) within the model using a batch of calibration data. They use activation-based metrics for this purpose. For instance, they compute the importance of each attention head based on its activations. Depth pruning is informed by metrics like perplexity and block importance.

Retraining with Knowledge Distillation

Retraining a pruned model involves using knowledge distillation where the pruned model (student) learns from the original model (teacher). The distillation process involves mimicking the teacher model's output distribution and intermediate states. The total loss function combines the cross-entropy loss with the ground truth and KD losses from the logits and intermediate states.

Experimental Results

The paper provides comprehensive experimental results:

Minitron 8B: This model achieved higher accuracy compared to the Nemotron-3 8B model and was on par with several state-of-the-art community models. The training tokens required for Minitron 8B were significantly lower, emphasizing the efficiency of the proposed method.
Minitron 4B: This smaller model retained competitive capabilities and outperformed some existing models despite its size.

Best Practices for Structured Compression

The authors derive a set of best practices from their experiments:

Train the largest model first, then prune and distill to obtain smaller models.
Apply specific importance estimation techniques, with preferences for width over depth pruning.
Use knowledge distillation exclusively for retraining rather than conventional methods.
Perform iterative pruning and lightweight retraining to stabilize rankings of pruned candidates.
Prune models at the final stage of multi-phase training strategies to retain model capabilities effectively.

Implications and Future Work

The practical implications of this research are significant:

Cost Reduction: The efficient training of a family of models using fewer resources.
Performance: Competitive performance with reduced computational and data costs.
Scalability: Applicability to various architectures and scales of models.

Theoretically, this research opens avenues for further exploration into the nuances of pruning strategies and the role of knowledge distillation in different model architectures.

The authors hint at potential future directions, such as the use of parameter-efficient fine-tuning techniques like LoRA during the retraining stage and extending their approach to instruction-tuned models.

Overall, this paper systematically addresses the inefficiencies in training LLMs by combining structured pruning with effective retraining through knowledge distillation, leading to significant advancements in the cost-efficiency and performance of smaller LLMs.