Sparse Fine-tuning for Inference Acceleration of Large Language Models (2310.06927v2)

Published 10 Oct 2023 in cs.CL and cs.AI

Abstract: We consider the problem of accurate sparse fine-tuning of LLMs, that is, fine-tuning pretrained LLMs on specialized tasks, while inducing sparsity in their weights. On the accuracy side, we observe that standard loss-based fine-tuning may fail to recover accuracy, especially at high sparsities. To address this, we perform a detailed study of distillation-type losses, determining an L2-based distillation approach we term SquareHead which enables accurate recovery even at higher sparsities, across all model types. On the practical efficiency side, we show that sparse LLMs can be executed with speedups by taking advantage of sparsity, for both CPU and GPU runtimes. While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth. We exhibit end-to-end results showing speedups due to sparsity, while recovering accuracy, on T5 (language translation), Whisper (speech translation), and open GPT-type (MPT for text generation). For MPT text generation, we show for the first time that sparse fine-tuning can reach 75% sparsity without accuracy drops, provide notable end-to-end speedups for both CPU and GPU inference, and highlight that sparsity is also compatible with quantization approaches. Models and software for reproducing our results are provided in Section 6.

PDF Abstract

Sparse Fine-tuning for Accelerated LLM Inference

The paper "Sparse Fine-tuning for Inference Acceleration of LLMs" by Eldar Kurtic et al. explores the challenge of efficiently fine-tuning LLMs while imposing sparsity on their weights. This research endeavor addresses two primary objectives: achieving high accuracy in sparse LLMs and realizing practical efficiency during inference through sparsity-based acceleration.

Key Contributions and Findings

Accuracy Recovery with High Sparsity: The paper identifies that conventional fine-tuning methods often fail to maintain accuracy at high sparsity levels. To mitigate this issue, the authors propose a distillation-based loss function named SquareHead, which incorporates L2-based distillation. Experimental results demonstrate that SquareHead enables effective accuracy recovery across various LLM types, even at higher sparsity levels.
Efficiency Gains: On the efficiency front, the authors show that leveraging sparsity can significantly speed up LLM inference on both CPUs and GPUs. They demonstrate this through the execution of sparse LLMs, such as T5 for language translation, Whisper for speech translation, and open GPT-type models like MPT for text generation. For instance, the MPT model achieved up to 75% sparsity without any accuracy decline. Moreover, they highlight that sparsity can be effectively combined with quantization techniques, further enhancing the computational efficiency.
Memory-Bound Inference: The research reveals a distinctive benefit of sparsity in mitigating memory bandwidth constraints, particularly for memory-bound LLM inference tasks. The paper provides empirical evidence that compressed sparse weights, which are decompressed on-the-fly during computation, can lead to substantial speedups.

Methodology

The methodology consists of several components designed to achieve sparse fine-tuning while maintaining high accuracy:

Sparsification:

Sparse LLMs are obtained by incrementally imposing higher sparsity levels during fine-tuning. This iterative approach ensures that the model adapts to sparsity progressively, thereby reducing the risk of instability and divergence.

Distillation Strategies:

The research evaluates various distillation strategies, including standard knowledge distillation (KD) using cross-entropy loss, and the proposed SquareHead KD, which incorporates intermediate layer distillation using normalized mean squared error (MSE) loss. SquareHead consistently outperforms other strategies by recovering accuracy even at higher sparsities.

Runtime Acceleration:

The runtime benefits of sparsity are realized through sophisticated algorithmic optimizations in sparse matrix computations. For GPUs, custom CUDA kernels are employed for efficient execution of N:M sparsity patterns, achieving notable speedups over dense execution.

Numerical Results and Practical Implications

The empirical evaluation spans various models and tasks, yielding robust findings:

For T5 model fine-tuning on the WMT14 English-German translation task, sparsity levels up to 75% are achieved with minimal performance degradation (BLEU scores). This translates to over 2x inference speedup on CPUs.
The Whisper model for Hindi ASR shows competitive Word Error Rates (WER) at 70-80% sparsity, with speedups reaching 2.5x on CPUs.
MPT-7B generative models fine-tuned on the GSM8K dataset successfully maintain performance at 70% sparsity, with a substantial speedup in generating tokens per second. Moreover, combined sparsity and INT8 quantization leads to remarkable gains in decoding speed, exemplified by a 9.08x speedup at 80% sparsity.

Future Directions

The findings of this paper open avenues for further investigation in the domain of efficient AI. The proposed techniques can be extended to larger LLMs and more complex tasks, enabling the deployment of resource-intensive models on devices with limited computational capacity. Future research could explore advanced quantization techniques and pretraining strategies that incorporate sparsity from the onset. The integration of these methods in real-world applications could significantly reduce operational costs and energy consumption, promoting the sustainable deployment of AI technologies.

Conclusion

This paper presents a comprehensive paper on sparse fine-tuning for LLMs, showcasing the dual benefits of accuracy preservation and inference acceleration. By employing the novel SquareHead distillation loss and leveraging sparsity in practical scenarios, the research delineates a clear path forward for efficient LLM usage. The implications of this work are far-reaching, potentially transforming the landscape of AI deployment in computationally constrained environments.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Eldar Kurtic (20 papers)
Denis Kuznedelev (21 papers)
Elias Frantar (24 papers)
Michael Goin (4 papers)
Dan Alistarh (133 papers)

Citations (5)

View on Semantic Scholar