Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks (2401.14109v2)

Published 25 Jan 2024 in cs.CL, cs.AI, cs.LG, and quant-ph
CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

Abstract: LLMs such as ChatGPT and LlaMA are advancing rapidly in generative AI, but their immense size poses significant challenges, such as huge training and inference costs, substantial energy demands, and limitations for on-site deployment. Traditional compression methods such as pruning, distillation, and low-rank approximation focus on reducing the effective number of neurons in the network, while quantization focuses on reducing the numerical precision of individual weights to reduce the model size while keeping the number of neurons fixed. While these compression methods have been relatively successful in practice, there is no compelling reason to believe that truncating the number of neurons is an optimal strategy. In this context, this paper introduces CompactifAI, an innovative LLM compression approach using quantum-inspired Tensor Networks that focuses on the model's correlation space instead, allowing for a more controlled, refined and interpretable model compression. Our method is versatile and can be implemented with - or on top of - other compression techniques. As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% the memory size of LlaMA 7B, reducing also 70% the number of parameters, accelerating 50% the training and 25% the inference times of the model, and just with a small accuracy drop of 2% - 3%, going much beyond of what is achievable today by other compression techniques. Our methods also allow to perform a refined layer sensitivity profiling, showing that deeper layers tend to be more suitable for tensor network compression, which is compatible with recent observations on the ineffectiveness of those layers for LLM performance. Our results imply that standard LLMs are, in fact, heavily overparametrized, and do not need to be large at all.

Methodology

The focus of the paper is on CompactifAI, a method considered novel due to its use of quantum-inspired Tensor Networks (TNs) for compressing LLMs. The paper outlines an innovative technique for compression that diverges from traditional methods such as pruning, distillation, quantization, and low-rank approximations, which typically truncate the number of effective neurons or reduce the numerical precision of weights.

The CompactifAI approach targets the correlation space within the model, favoring a more nuanced and controlled compression strategy. Versatile by design, it can augment existing compression techniques to drive further model efficiency. The authors demonstrate that even after massive compression, the model retains over 90% of its initial accuracy with a brief period of distributed retraining.

Implementation

The authors describe a compression pipeline where weight matrices in neural networks are decomposed into Tensor Networks like Matrix Product Operators (MPOs). Truncating the correlations in the LLMs' layers, specifically in self-attention and multi-layer perceptron layers, is enabled through controlling the bond dimension of the TN. A crucial benefit of this method is its efficiency with significantly diminished energy and memory requirements. The highlighted LlaMA models have their weight matrices duly reshaped and decomposed, with a substantial parameter count reduction. Subsequently, the paper explains retraining the tensorized model using distributed training enables near-original accuracy of the compressed version, emphasizing its suitability for LLM fine-tuning.

Results

Benchmarking the CompactifAI methodology involved the LlaMA-2 7B model, part of META’s LlaMA series. The authors used quantization to halve the memory requirement from float32 to float16, followed by a Tensor Network compression that reduced the model to 30% of its size in float16. Noteworthy is the fact that following additional retraining on text summarization tasks using the XSum and Gigaword datasets, the compressed model achieved nearly 90% of the accuracy of the original model.

Conclusions & Prospects

The CompactifAI method presents a significant advancement in creating energy-efficient and more accessible LLMs. It allows for profound reductions in model size with minimal accuracy loss, offering a more sophisticated alternative to existing compression techniques. This work potentially paves the path for on-premises deployment of LLMs, expanding application fields to areas not reliant on cloud connectivity. The compatibility with other compression methods further strengthens the case for CompactifAI as a versatile and potent tool in AI development, potentially driving the democratization of AI technologies and mitigating their environmental footprint.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Andrei Tomut (1 paper)
  2. Saeed S. Jahromi (28 papers)
  3. Sukhbinder Singh (15 papers)
  4. Faysal Ishtiaq (2 papers)
  5. Cesar Muñoz (14 papers)
  6. Prabdeep Singh Bajaj (1 paper)
  7. Ali Elborady (1 paper)
  8. Gianni del Bimbo (3 papers)
  9. Mehrazin Alizadeh (4 papers)
  10. David Montero (10 papers)
  11. Muhammad Ibrahim (16 papers)
  12. Oussama Tahiri Alaoui (1 paper)
  13. John Malcolm (1 paper)
  14. Samuel Mugel (22 papers)
  15. Roman Orus (77 papers)
  16. Abhijoy Sarkar (1 paper)
  17. Uygar Kurt (2 papers)
  18. Pablo Martin-Ramiro (4 papers)
Citations (8)
Youtube Logo Streamline Icon: https://streamlinehq.com