Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Data Distillation for Recovering Quality in Pruned Large Language Models (2410.09982v3)

Published 13 Oct 2024 in cs.LG and cs.CL

Abstract: LLMs have driven significant progress in natural language processing, but their deployment requires substantial compute and memory resources. As models scale, compression techniques become essential for balancing model quality with computational efficiency. Structured pruning, which removes less critical components of the model, is a promising strategy for reducing complexity. However, one-shot pruning often results in significant quality degradation, particularly in tasks requiring multi-step reasoning. To recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting by shifting the model's learned data distribution. Therefore, addressing the degradation from both pruning and SFT is essential to preserve the original model's quality. In this work, we utilize self-data distilled fine-tuning to address these challenges. Our approach leverages the original, unpruned model to generate a distilled dataset that preserves semantic richness and mitigates catastrophic forgetting by maintaining alignment with the base model's knowledge. Empirically, we demonstrate that self-data distillation consistently outperforms standard SFT, improving average accuracy by up to 8% on the HuggingFace OpenLLM Leaderboard v1. Specifically, when pruning six decoder blocks on Llama3.1-8B Instruct (i.e., 32 to 26 layers, reducing the model size from 8.03B to 6.72B parameters), our method retains 91.2% of the original model's accuracy compared to 81.7% with SFT, while reducing real-world FLOPs by 16.3%. Furthermore, combining self-data distilled models through model merging yields enhanced quality retention. Additionally, leveraging these pruned models in speculative decoding increases token acceptance rates, thereby improving inference efficiency in applied settings.

Self-Data Distillation for Recovering Quality in Pruned LLMs

The paper "Self-Data Distillation for Recovering Quality in Pruned LLMs" addresses a critical aspect of deploying LLMs efficiently: model pruning. LLMs have transformed NLP, yet their deployment demands substantial computational resources, a challenge exacerbated as these models grow in size. The paper primarily focuses on structured pruning, an approach that removes less critical components of the model, but typically at the cost of reduced accuracy.

Problem Definition and Motivation

The inherent problem with existing pruning methods, notably one-shot pruning, is significant degradation in model performance, particularly in tasks that require multi-step reasoning. To mitigate this quality loss, the paper explores fine-tuning strategies, specifically supervised fine-tuning (SFT). However, SFT can lead to catastrophic forgetting due to distribution shifts in the model's learned data. This work introduces self-data distilled fine-tuning as a strategy to counter these challenges, leveraging the unpruned model to maintain semantic richness and alignment with the base model's original knowledge.

Methodology

The methodology centers on two key processes:

  1. Structured Layer Pruning: Using an angular cosine distance metric to evaluate layer importance, layers producing similar activations are pruned. This method ensures the removal of redundant layers with minimal impact on model capacity. The paper provides a detailed algorithm for this pruning strategy, with empirical validation suggesting minimal accuracy loss for tasks with optimized pruning.
  2. Self-Data Distilled Fine-Tuning: This process employs the original model to generate a distilled dataset, aligning the fine-tuning data with the model's learned distribution. This alignment mitigates catastrophic forgetting and improves post-pruning accuracy. The technique outperforms traditional SFT by up to 8% in accuracy retention, particularly at significant pruning levels.

Experimental Results

The empirical evaluation of the method is conducted on Llama3.1-8B Instruct models using datasets such as GSM8k and OpenMathInstruct for fine-tuning. Notably, the self-data distillation strategy enabled the pruned model to retain 91.2% of the original model's accuracy compared to 81.7% with SFT. The framework demonstrated robustness across various dataset sizes, with the improvements more pronounced as the dataset size increased.

Discussion of Contributions

This research contributes significantly to the broader discourse on model efficiency. By introducing self-data distillation, the authors present a novel approach to preserving model quality post-pruning. The method’s ability to reduce FLOPs substantially while maintaining high accuracy across a range of benchmark tasks positions it as an efficient solution for LLM deployment challenges.

Implications and Future Directions

The practical implications of this research are profound—reduced computational costs could democratize access to AI capabilities, facilitating wider application in resource-constrained environments. Theoretically, this work lays a foundation for further exploration into hybrid strategies that combine pruning with other compression techniques such as quantization or knowledge distillation.

Speculatively, as AI continues its trajectory towards increasing scale and complexity, self-data distillation could integrate with emerging paradigms in dynamic model scaling or elastic computing infrastructures, offering even greater flexibility and efficiency in model deployment. Future research might also explore the integration of self-data distillation with more advanced continual learning techniques to further mitigate catastrophic forgetting across diverse task domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vithursan Thangarasa (13 papers)
  2. Ganesh Venkatesh (14 papers)
  3. Nish Sinnadurai (3 papers)
  4. Sean Lie (7 papers)
  5. Mike Lasby (6 papers)