ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models (2312.05821v4)
Abstract: In this paper, we introduce a new post-training compression paradigm for LLMs to facilitate their wider adoption. We delve into LLM weight low-rank decomposition, and find that the challenges of this task stem from the distribution variance in the LLM activations and the sensitivity difference among various kinds of layers. To address these issues, we propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD). Specifically, ASVD manages activation outliers by transforming the weight matrix based on the activation distribution. This transformation allows the outliers in the activation matrix to be absorbed into the transformed weight matrix, thereby enhancing decomposition accuracy. Additionally, we propose an efficient iterative calibration process to optimize layer-specific decomposition by addressing the varying sensitivity of different LLM layers. In this way, ASVD can compress a network by 10%-30%. Based on the success of the low-rank decomposition of projection matrices in the self-attention module, we further introduce ASVD to compress the KV cache. By reducing the channel dimension of KV activations, memory requirements for KV cache can be largely reduced. ASVD can further achieve 50% KV cache reductions without performance drop in a training-free manner. Code is anonymously available in supplementary materials.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
- Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Demmel, J. W. Applied numerical linear algebra. SIAM, 1997.
- Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532, 2020.
- Exploiting linear structure within convolutional networks for efficient evaluation. Advances in neural information processing systems, 27, 2014.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
- Sparsegpt: Massive language models can be accurately pruned in one-shot. ICML, 2023.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Language model compression with weighted low-rank factorization. In ICLR, 2022.
- Training cnns with low-rank filters for efficient image classification. arXiv preprint arXiv:1511.06744, 2015.
- Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
- Initialization and regularization of factorized neural layers. In ICLR, 2021.
- Efficient neural network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12569–12577, 2019.
- Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
- Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp. 611–626, 2023.
- Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.
- Optimal brain damage. Advances in neural information processing systems, 2, 1989.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Acdc: A structured efficient linear layer. arXiv preprint arXiv:1511.05946, 2015.
- Oseledets, I. V. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
- Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. IEEE, 2013.
- Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. Advances in Neural Information Processing Systems, 35:20051–20063, 2022.
- Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034, 2023.
- Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Pufferfish: Communication-efficient models at no extra cost. Proceedings of Machine Learning and Systems, 3:365–386, 2021.
- Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 2022.
- Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
- Coordinating filters for faster deep neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 658–666, 2017.
- Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526, 2023.
- Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023.
- A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023.