- The paper introduces APT’s key contribution of using outlier-aware salience scoring to dynamically prune redundant parameters, reducing computational costs.
- The paper details an adaptive tuning procedure that selectively enhances influential layers to speed up convergence and optimize memory utilization.
- The paper demonstrates that APT achieves up to 98% task performance with 40% parameter pruning and reduces training time by up to 8 times compared to existing methods.
Overview of APT: Adaptive Pruning and Tuning
Adaptive Pruning and Tuning (APT) is a novel approach introduced to address two critical challenges in fine-tuning and inference of LLMs (LMs) – the high cost of memory and computational efficiency. The method adaptively prunes and tunes parameters within the LM, aiming to improve model performance and training efficiency significantly.
Adaptive Pruning Strategies
APT's adaptive pruning dynamically adjusts the number of salient tuning parameters in the early stages of model training. By evaluating the salience scoring function against outlier-aware metrics, APT efficiently identifies and discards unimportant parameters, enhancing both training and inference efficiency without compromising accuracy. Unlike previous techniques that either tune a fixed set of parameters or require a fully trained teacher for distillation, APT incorporates outlier-aware salience scoring for proactive pruning. This approach leads to a drastic reduction in training and inference time, particularly noteworthy when training large LMs like LLaMA.
Adaptive Tuning Procedures
APT is not only about pruning but also fine-tuning where the selected model parameters undergo adaptive tuning throughout the fine-tuning phase. The method involves dynamically adding parameters based on layer importance, which is determined by the computed salience. This significantly accelerates LM convergence and recovers model performance after pruning. Unlike static tuning layers, APT’s adaptability ensures that only the most influential layers are enhanced, leading to efficient memory utilization while fine-tuning without necessitating additional computational resources for inference.
Analysis and Comparison
Experiments with APT demonstrate its compelling capabilities. It maintains up to 98% task performance with as much as 40% parameter pruning in RoBERTa and T5 models, and an impressive 86.4% with 70% parameter retention in LLaMA models. When contrasted with benchmark methods like LoRA and structured pruning, APT shows augmented training efficiency, pruning smaller models up to 8 times faster and exhibiting a 70% reduction in memory footprint with large models like LLaMA.
Conclusion
APT represents a significant leap towards enhancing training and inference efficiencies in LMs. Its adaptive pruning and tuning framework play a foundational role in maintaining high model performance with considerably fewer parameters. Moreover, it accelerates convergence and significantly reduces the memory and compute burden, facilitating the practical application of LMs even on hardware with strict limitations. Future research could explore extending APT's paradigm to more sophisticated PEFT architectures, potentially achieving even greater performance recovery in large-scale LLMs.