LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2208.07339v2)

Published 15 Aug 2022 in cs.LG and cs.AI

Abstract: LLMs have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer LLMs that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software.

Authors (4)

Tim Dettmers (22 papers)
Mike Lewis (78 papers)
Younes Belkada (9 papers)
Luke Zettlemoyer (225 papers)

Citations (549)

View on Semantic Scholar

Summary

Overview of 8-bit Matrix Multiplication for LLMs

The paper "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" tackles a significant challenge in the domain of large-scale NLP models: the high memory requirements posed by these models during inference. The authors propose an innovative approach to quantize transformer models into 8-bit representations, specifically focusing on feed-forward and attention projection layers, thereby halving the memory footprint while maintaining performance integrity.

Key Contributions

This work introduces LLM.int8(), a two-part quantization strategy that skillfully balances precision and memory efficiency:

Vector-wise Quantization: This method improves upon row-wise quantization by using separate normalization constants for each inner product within the matrix multiplication. This enhances precision by allowing different parts of the matrix to be quantized independently.
Mixed-precision Decomposition: To address the challenges posed by extreme outlier features observed in large transformer models, this method isolates these outliers into 16-bit computations, while the majority (99.9%) of operations remain in 8-bit.

These techniques allow large models, such as those with up to 175B parameters, to be utilized effectively on commodity hardware with consumer GPUs.

Numerical Results and Implications

The empirical results are compelling:

Models up to 175B parameters experience no degradation in performance when benchmarked in terms of both perplexity and zero-shot task accuracy.
The approach enables significant reduction in memory requirements, making it feasible to deploy massive models on a single server.
Insights into systematic outlier phenomena across transformer layers provide a nuanced understanding of their impact on attention mechanisms and overall model accuracy.

Implications and Future Directions

Practically, LLM.int8() can democratize access to state-of-the-art large models, enabling more organizations and researchers to experiment and innovate without the prohibitive hardware requirements traditionally associated with such models.

Theoretically, the paper opens avenues for further exploration into the nature of emergent outlier features within transformers. Understanding these features could lead to more robust, scalable, and efficient model designs in the future.

The research also prompts questions about the potential for training efficiencies using mixed-precision approaches. While this paper focuses on inference, the methodologies proposed might spark investigations into low-bit training paradigms, which could lead to even broader implications for model scalability and accessibility.

Conclusion

The paper successfully addresses a crucial limitation in deploying LLMs by introducing methods that maintain model performance while significantly reducing memory requirements. The insights into emergent feature behavior in transformers and their impact on quantization precision are particularly valuable, both for immediate practical applications and for shaping future research trajectories in AI.

Related Papers

Find Related Papers

Tweets

https://twitter.com/beingavishkar/status/1806956268856447452

https://twitter.com/800363709528883200/status/1667562295143153665

https://twitter.com/18364654/status/1667553008920719360

https://twitter.com/3257708875/status/1667560392531361794

https://twitter.com/38374100/status/1667550230882156544

YouTube

Show All Videos