8-bit Optimizers via Block-wise Quantization (2110.02861v2)

Published 6 Oct 2021 in cs.LG

Abstract: Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in LLMs. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter LLMing, GLUE finetuning, ImageNet classification, WMT'14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.

Authors (4)

Tim Dettmers (22 papers)
Mike Lewis (78 papers)
Sam Shleifer (15 papers)
Luke Zettlemoyer (225 papers)

Citations (217)

View on Semantic Scholar

Summary

The paper introduces 8-bit stateful optimizers using block-wise quantization to reduce memory overhead while matching the performance of 32-bit counterparts.
Block-wise dynamic quantization divides tensors into smaller segments to isolate outliers and flexibly manage precision in gradient statistics.
Experimental results show that 8-bit optimizers achieve comparable outcomes in language modeling and image classification, saving around 2GB of memory and reducing training time.

8-bit Optimizers via Block-wise Quantization

The paper "8-bit Optimizers via Block-wise Quantization" tackles the problem of reducing memory usage in stateful optimizers, essential tools in training large-scale neural models, by using 8-bit quantization techniques. The authors propose a novel approach that achieves memory efficiency while maintaining the performance typically associated with 32-bit states, fundamentally altering the trade-offs between computational resources and model scaling. This approach enables the allocation of more memory resources to increase model size rather than model state during training.

Methodology

Stateful optimizers like Adam and Momentum utilize gradient statistics stored in memory, and while 32-bit representation has been the standard, it constitutes a significant memory overhead. This paper introduces the concept of using 8-bit statistics through block-wise dynamic quantization, an innovative strategy to manage numerical precision and stability.

Block-wise dynamic quantization involves dividing tensors into smaller blocks, independently quantizing each block. This not only enhances parallel processing capabilities across cores but also isolates data outliers, leading to improvement in quantization precision. The authors combine this method with dynamic quantization, which allows for a flexible quantization scheme addressing values of varying magnitudes effectively, and a stable embedding layer to diminish gradient variance caused by non-uniform input distributions.

Results

The results presented in the paper indicate that 8-bit optimizers match the performance of their 32-bit counterparts in a range of tasks, including LLMing, image classification, and machine translation. For instance, in the task of RoBERTa pretraining, the 8-bit Adam optimizer achieved a median GLUE score comparable to the 32-bit Adam but with substantially reduced memory requirements—saving approximately 2 GB of memory. Moreover, the paper reports a reduction in training time and the ability to train larger models within constrained memory setups, enhancing accessibility for users without access to high-memory GPUs.

Implications and Future Work

The implications of this research are significant for both practical and theoretical aspects of AI development. Practically, it facilitates larger model training within existing hardware constraints, potentially democratizing access to advanced machine learning technologies. Theoretically, it contributes to the understanding of how quantization affects model stability and optimization dynamics.

Future research directions could explore the integration of 8-bit optimization with other model compression techniques such as pruning and knowledge distillation. Additionally, extending this approach to model architectures that heavily rely on large activations, such as deep convolutional networks, could provide further insights into generalized quantization schemes.

In summary, the paper makes a substantial contribution to the landscape of efficient model optimization, providing both a practical tool for immediate application and a foundation for future exploration into low-bit optimization schemes.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1807806733781200898

YouTube

Show All Videos