EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Published 28 Oct 2024 in cs.CL and cs.AI | (2410.21271v4)

Abstract: While post-training compression techniques effectively reduce the memory footprint, latency, and power consumption of LLMs, they often result in noticeable accuracy degradation and remain limited by hardware and kernel constraints that restrict supported compression formats ultimately reducing flexibility across a wide range of deployment scenarios. In this work, we propose EoRA, a novel fine-tuning-free method that augments compressed LLMs with low-rank matrices, allowing users to rapidly enhance task-specific performance and freely balance the trade-off between accuracy and computational overhead beyond the constraints of compression formats. EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs, achieving notable accuracy improvements (e.g., $\mathbf{10.84\%}$ on ARC-Challenge, $\mathbf{6.74\%}$ on MathQA, and $\mathbf{6.74\%}$ on GSM8K) for LLaMA3-8B compressed to 3-bit. We also introduce an optimized CUDA kernel, accelerating inference by up to 1.4x and reducing memory overhead through quantizing EoRA. Overall, EoRA offers a prompt solution for improving the accuracy of compressed models under varying user requirements, enabling more efficient and flexible deployment of LLMs. Code is available at https://github.com/NVlabs/EoRA.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper introduces a novel training-free compensation method using eigenspace low-rank approximations to mitigate compression errors in LLMs.
It projects errors onto input activation eigenspaces to prioritize reconstructing critical weight components without gradient-based retraining.
Benchmark tests reveal up to 31.31% accuracy improvement on ARC-E and robust performance across diverse tasks and extreme compression settings.

An Overview of EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

The research paper by Shih-Yang Liu et al. presents a novel approach to mitigating the errors introduced during the compression of LLMs. The proposed method, Training-free Eigenspace Low-Rank Approximation (EoRA), re-conceptualizes model compression as a problem of customized compensation. This reframing allows for the integration of low-rank residual paths to compensate for compression errors across different tasks and settings without being constrained by specific compression formats.

EoRA Methodology

Traditional model compression approaches, such as SVD-based methods, tend to suboptimally utilize low-rank representation capacity due to their reliance on generic decompositions, such as singular value decomposition (SVD), which fails to account for the importance variability in model weights. EoRA addresses these limitations by projecting compression errors into the eigenspace of input activations and prioritizing the reconstruction of higher-importance error components using eigenvalues. This primarily eliminates the need for gradient-based training and optimizes the model compensations swiftly using limited calibration data.

Experimental Outcomes

EoRA was benchmarked against previous SVD methods on numerous tasks, including language generation (WikiText2), commonsense reasoning (ARC-Easy/ARC-Challenge), and mathematical reasoning (MathQA) across compressed LLaMA2/3 models with noteworthy outcomes. For instance, when LLaMA3-8B is pruned to a 2:4 sparsity and quantized to 4-bit, EoRA improved the accuracy by 31.31% on ARC-E and 12.88% on ARC-C compared to baseline models. Furthermore, EoRA's compensation efficacy was demonstrated in more aggressive scenarios, especially showing robustness across different model sizes and within extreme compression settings.

Implications and Future Speculation

This paper's implications extend broadly across both theoretical and practical spectrums of AI deployment. Practically, EoRA's training-free, swift error compensation method facilitates the effective deployment of large-scale models with enhanced efficiency, crucial for settings with varying computational constraints. Theoretically, this contribution stimulates further exploration into eigenspace-based optimization methods in varying domains of deep learning.

The interplay between eigenspace projections and low-rank approximation opens promising avenues for more nuanced adaptative techniques in model compression, inviting future work to explore adaptive mechanisms in even more complex architectures while retaining the generalized flexibility offered by EoRA.

Conclusion

EoRA emerges as a significant contribution to the domain of model compression for LLMs, providing scalable, efficient, and adaptable solutions to compensate for errors introduced during compression. Its simplicity, combined with robustness against quantization and various compression methods, makes it a powerful tool for balancing accuracy loss and model capacity—thereby facilitating the practical deployment of efficient, large-scale AI models across diverse computational landscapes. The groundwork laid by this research nurtures further exploration into advanced, adaptive techniques for optimizing modern AI infrastructures.

Markdown Report Issue