Prune Once for All: Sparse Pre-Trained Language Models (2111.05754v1)

Published 10 Nov 2021 in cs.CL, cs.AI, and cs.LG

Abstract: Transformer-based LLMs are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer LLMs by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.

PDF Abstract

Overview of "Prune Once for All: Sparse Pre-Trained LLMs"

The paper "Prune Once for All: Sparse Pre-Trained LLMs" presents an innovative methodology for enhancing the efficiency of Transformer-based LLMs, focusing on addressing resource constraints and deployment challenges. The proposed approach, termed Prune OFA (Prune Once for All), leverages weight pruning and model distillation to create sparse pre-trained models that are applicable across various NLP tasks without significant performance loss.

Key Contributions

The Prune OFA method offers several contributions:

Architecture-Agnostic Sparse Pre-Training: The authors introduce a generalized technique to train sparse versions of BERT-Base, BERT-Large, and DistilBERT, specifically during the pre-training phase. This approach does not necessitate task-specific alterations, reducing the tuning overhead frequently associated with deploying sparse models for particular NLP tasks.
Transfer Learning Efficiency: By maintaining the sparsity pattern inherent in the pre-trained models, these models can subsequently be fine-tuned on downstream tasks like SQuAD, MNLI, QNLI, SST-2, and QQP with minimal accuracy degradation. This reduces computational demands during the transfer learning phase.
Advanced Compression Techniques: The paper demonstrates the integration of quantization-aware training to further compress model weights to 8-bit precision, achieving substantial reduction in memory footprint. For instance, the BERT-Large model, when fine-tuned and quantized to 8-bit precision, results in a compression ratio of 40X for the encoder with less than 1% accuracy loss.

Methodological Insights

Weight Pruning

The approach utilizes both gradual magnitude pruning and novel learning rate rewinding strategies to prune and refine models during the pre-training phase. The integration of neural network architecture pruning with these methods ensures that models maintain high sparsity ratios while preserving task performance.

Knowledge Distillation

Knowledge distillation plays a pivotal role in the pruning process, where a dense teacher model guides the pruning of a sparse student model. The utilization of distillation during both pre-training and fine-tuning ensures that the knowledge captured by large models is effectively transferred, even in sparse representations.

Pattern-Lock

The paper introduces a method called Pattern-lock, ensuring that during fine-tuning, weights identified as zero in the sparse model remain zero. This technique is crucial to uphold the efficiency gains achieved through pruning during pre-training.

Experimental Results

The experiments affirm the effectiveness of the Prune OFA method across three architectures. When tested on various NLP benchmarks, the sparse models trained using Prune OFA displayed negligible performance decline compared to their dense counterparts. Specifically, the BERT-Base model maintained a performance drop of below 1% at an 85% sparsity level across several tasks, demonstrating the viability of the approach for practical deployment of LLMs in resource-constrained environments.

Furthermore, the combination of Prune OFA and quantization yields highly compressed models that preserve accuracy. The results underscore the trade-off between sparsity and quantization, where slightly higher sparsity and precision lead to compact, efficient models better suited for edge deployments.

Implications and Future Research

The Prune OFA methodology holds significant implications for the deployment of large NLP models, making them more accessible without compromising accuracy. Practically, reducing model size can contribute to lessening the environmental impact of large-scale machine learning applications by lowering energy consumption and hardware requirements.

The research opens avenues for exploring the capabilities of sparse models further, such as investigating the transfer potential of sparsity patterns across non-standard NLP tasks or extending the approach to models with even larger parameter counts. Additionally, there is potential to explore whether adopting sparse pre-training strategies could influence the architecture design of future LLMs, encouraging the development of inherently sparse architectures.

Overall, "Prune Once for All: Sparse Pre-Trained LLMs" provides a comprehensive framework for reducing model size and computational requirements, paving the way for more sustainable AI applications in various commercial and research domains.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Ofir Zafrir (5 papers)
Ariel Larey (8 papers)
Guy Boudoukh (5 papers)
Haihao Shen (11 papers)
Moshe Wasserblat (22 papers)

Citations (73)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos