Lillama: Large Language Models Compression via Low-Rank Feature Distillation (2412.16719v2)

Published 21 Dec 2024 in cs.LG and cs.AI

Abstract: Current LLM structured pruning methods typically involve two steps: (1) compression with calibration data and (2) costly continued pretraining on billions of tokens to recover lost performance. This second step is necessary as the first significantly impacts model accuracy. Prior research suggests pretrained Transformer weights aren't inherently low-rank, unlike their activations, which may explain this drop. Based on this observation, we propose Lillama, a compression method that locally distills activations with low-rank weights. Using SVD for initialization and a joint loss combining teacher and student activations, we accelerate convergence and reduce memory use with local gradient updates. Lillama compresses Mixtral-8x7B within minutes on a single A100 GPU, removing 10 billion parameters while retaining over 95% of its original performance. Phi-2 3B can be compressed by 40% with just 13 million calibration tokens, resulting in a small model that competes with recent models of similar size. The method generalizes well to non-transformer architectures, compressing Mamba-3B by 20% while maintaining 99% performance.

Summary

The paper introduces a one-shot compression strategy using low-rank feature distillation to reduce model complexity.
It leverages SVD initialization and a local distillation loss to maintain up to 99% performance and improve inference speed by 20%.
The method is model-independent and validated across various architectures, offering practical benefits over resource-intensive alternatives.

LLMs Compression via Low-Rank Feature Distillation

Introduction

The paper addresses the challenge of compressing LLMs without the typically high computational cost and performance degradation. Current methods involve steps like structured pruning followed by extensive pretraining, which are resource-intensive. The introduced method focuses on a one-shot compression strategy that utilizes low-rank feature distillation to maintain model performance with reduced memory and computational requirements.

Proposed Method

The proposed methodology involves a three-step process:

Layer Selection and SVD Initialization: The method begins by selecting layers for compression to achieve a target compression ratio. This is done using Singular Value Decomposition (SVD) to initialize low-rank weights, which offers an optimal low-rank approximation.
Figure 1: Our compression approach: STEP 1 selects layers to compress for a target compression ratio (e.g., N\%) using various strategies.
Local Distillation Loss: A joint loss function combines teacher and student activations, which accelerates convergence and enhances performance. It significantly reduces memory requirements by using local gradient updates only, tailored for achieving high efficiency.
Figure 2: The joint loss converges generally better. Convergence of the three losses illustrated in Figure~\ref{fig:distillaton}.
Layer-Wise Implementation: The compression is model-independent and layer-wise, hence applicable to both linear and non-linear layers, which allows it to be generalized across various model architectures, including non-Transformer models like Mamba.

Experimental Results

The paper provides comprehensive results demonstrating the efficiency of the method:

Compression Efficiency: Mixtral-8x7B and Phi-2 3B were compressed significantly while retaining over 95% performance. The Mamba-3B architecture, when compressed, maintained 99% of its performance.

(Table 1 & Table 2 illustrations in the original paper context)

Inference Speed and Memory Gain: The compression improves inference speed by up to 20% and achieves significant memory reduction. For instance, compressing the Mixtral model enabled fitting a model on a single A100 GPU that was previously not feasible.
Comparison with Existing Methods: The paper’s method stands out against current methods like SparseGPT and Wanda by being less computationally intensive and independent of specialized GPU kernels.

Implementation Details

The implementation involves core PyTorch components and simple plug-in configurations for applying the proposed compression method across models. The core algorithm functions using local optimization, and its minimal GPU memory usage makes it suitable for large-scale deployments.

Figure 3: An example of PyTorch implementation of our approach with the Teacher loss.

Limitations and Future Work

The current trade-offs in complexity versus the generalizability of the compression present a balance between speed and applicability. Future work is envisaged to explore complementing the technique with quantization and additional pretraining stages, potentially enhancing practical deployments further.

Conclusion

This low-rank feature distillation approach presents a valuable method for efficiently compressing LLMs with minimal loss in performance. The technique's adaptability across various architectures and its efficient resource usage make it a strong candidate for real-world applications requiring optimized computational efficiency.