Emergent Mind

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

(2404.00456)
Published Mar 30, 2024 in cs.LG

Abstract

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4-bits, without any channels identified for retention in higher precision. Our quantized LLaMa2-70B model has losses of at most 0.29 WikiText-2 perplexity and retains 99% of the zero-shot performance. Code is available at: https://github.com/spcl/QuaRot.
Distribution of activation responses within a neural network during different phases of learning.

Overview

  • QuaRot introduces a novel quantization scheme for LLMs that reduces bit-width to 4-bits end-to-end while retaining high model fidelity.

  • Utilizes rotational transformations via randomized Hadamard transformations to neutralize outlier features, simplifying quantization and maintaining accuracy.

  • Demonstrates application to a quantized Llama model, achieving significant efficiency gains and minimal impact on performance across various tasks.

  • Opens new avenues for deploying sophisticated language models in resource-constrained environments, highlighting its potential for future exploration and integration into hardware accelerators.

Overview

In the pursuit of enhancing efficiency in LLMs, the paper presents QuaRot, a novel quantization scheme that substantially simplifies the process of reducing the bit-width of model parameters, activations, and KV cache to 4-bits end-to-end. By devising a unique approach that utilizes rotational transformations to mitigate outliers in the data, QuaRot facilitates high fidelity in low-precision inference, overcoming the prevalent challenge of significant performance degradation associated with conventional quantization techniques. The quantitative results reveal minimal losses in model performance, with WikiText-2 perplexity losses not exceeding 0.29 and retention of 99% zero-shot performance accuracy.

Key Contributions

QuaRot's methodology leverages computational invariance, employing randomized Hadamard transformations to eliminate outlier features thus simplifying the quantization. This process does not alter the model's output but makes both weights and activations more amenable to quantization, enabling 4-bit matrix multiplications throughout the model. This approach is distinctly beneficial for LLMs, offering a practical solution for deploying advanced neural network models in resource-constrained environments. The paper articulates the following major contributions:

  • Introduction of the QuaRot quantization strategy that utilizes rotational operations to eliminate outliers in LLMs, facilitating effective 4-bit quantization.

  • Demonstration of its application to a quantized Llama model, showing significant efficiency gains including up to 2.16x prefill speedups and 3.39x memory savings during the decoding stage with minimal impact on performance.

  • Provision of empirical evidence showing that QuaRot maintains high accuracy levels across diverse tasks, underscoring its practicality for real-world applications.

  • Release of the accompanying codebase, offering the community a robust toolset for efficient LLM deployment.

The Insights Behind QuaRot

The pivotal insight QuaRot capitalizes on is the computational invariance property inherent in LLMs. This allows for the hidden states and activations within these models to undergo rotation transformations without affecting the output. By applying randomized Hadamard transformations, QuaRot effectively neutralizes outlier features that traditionally complicate low-bit quantization.

The technique significantly extends beyond the capabilities of existing approaches by ensuring that all components - including the KV cache pivotal for the attention mechanism - are quantized. This holistic approach contrasts with previous studies which primarily focused on quantizing either weights or activations separately, often leaving the KV cache in higher precision to avoid performance losses.

Implications and Future Directions

The introduction of QuaRot offers several theoretical and practical implications. It not only advances our understanding of the quantization landscape for LLMs but also opens up new avenues for deploying sophisticated language models on hardware with limited computational resources. This is particularly relevant in scenarios where deploying full precision models is unfeasible due to power, speed, or memory constraints.

Looking forward, the exploration of applying QuaRot to a broader range of models and tasks stands as a promising research direction. Further, investigating the integration of QuaRot within hardware accelerators could elucidate pathways to achieving even higher efficiency gains.

In conclusion, QuaRot represents a significant step forward in the realm of model quantization, offering a viable solution to the pressing challenge of deploying LLMs efficiently. The methodology's ability to preserve model integrity while dramatically reducing computational load holds considerable promise for the future of AI applications, particularly in resource-constrained environments.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
  2. SliceGPT: Compress Large Language Models by Deleting Rows and Columns
  3. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence
  4. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36
  5. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  6. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems
  7. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332
  8. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
  9. Extreme Compression of Large Language Models via Additive Quantization
  10. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
  11. A framework for few-shot language model evaluation. Version v0. 0.1. Sept
  12. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
  13. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
  14. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
  15. Pointer sentinel mixture models
  16. NVIDIA. Nvidia cutlass library, 2023. https://github.com/NVIDIA/cutlass/.

  17. PyTorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32
  18. Language models are unsupervised multitask learners. 2019.
  19. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106
  20. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
  21. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR
  22. Neil J A Sloane. A library of hadamard matrices, 2024. http://neilsloane.com/hadamard/.

  23. RoFormer: Enhanced Transformer with Rotary Position Embedding
  24. Llama 2: Open foundation and fine-tuned chat models
  25. QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
  26. Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35:17402–17414
  27. HuggingFace's Transformers: State-of-the-art Natural Language Processing
  28. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR
  29. Zihao Ye. FlashInfer: Kernel Library for LLM Serving. https://github.com/flashinfer-ai/flashinfer

  30. HellaSwag: Can a Machine Really Finish Your Sentence?
  31. Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Show All 31

Test Your Knowledge

You answered out of questions correctly.

Well done!