QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models (2310.16795v1)

Published 25 Oct 2023 in cs.LG

Abstract: Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of LLMs via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models are available at github.com/IST-DASLab/qmoe.

Citations (15)

View on Semantic Scholar

Summary

The paper introduces QMoE, which compresses trillion-parameter MoE models to sub-1-bit storage without retraining.
It demonstrates a 20x memory reduction by compressing a 3.2TB model to under 160GB with less than 5% runtime overhead.
The work leverages data-dependent quantization and custom GPU kernels to maintain model performance while significantly easing hardware demands.

An Expert Overview of "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models"

The paper, "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models," by Elias Frantar and Dan Alistarh, addresses the significant challenges posed by the memory demands of Mixture-of-Experts (MoE) architectures in LLMs. The authors propose a novel framework, QMoE, which efficiently compresses massive models to less than 1 bit per parameter, allowing them to run on commodity hardware.

Background and Motivation

MoE architectures, by employing sparse routing, promise enhanced accuracy and faster inference at the cost of increased parameter counts. For instance, the SwitchTransformer-c2048 model features 1.6 trillion parameters, requiring 3.2TB of memory, presenting deployment challenges. The need to reduce the substantial memory consumption of such models is clear, particularly since typical compression techniques do not achieve the required reductions without significant accuracy degradation.

Key Contributions

Frantar and Alistarh introduce QMoE, which yields highly effective compression, achieving a sub-1-bit-per-parameter format without retraining, while maintaining minimal accuracy loss. Specific attention is given to the largest publicly available MoE, the SwitchTransformer-c2048, which QMoE compresses from 3.2TB to under 160GB, achieving a 20x reduction with less than 5% runtime overhead.

Technical Approach

The authors detail several innovations that enable efficient compression and decoding:

Scalable Compression Algorithm: The approach leverages data-dependent quantization, targeting ternary quantization with minimal calibration data, which surprisingly results in models resistant to quantization noise.
Custom Compression Format: By encoding quantized weights considering their low entropy (due to sparsity), the authors implement a format that facilitates sub-1-bit storage.
Bespoke GPU Kernels: These are co-designed to allow efficient decoding directly on GPUs, minimizing processing delays typical of compressed models.

Numerical Results

The paper reports that QMoE achieves compression rates of up to 20.07x on MoE modules, translating to significant model size reductions. The compressed SwitchTransformer-c2048 model sustains a validation loss increase of only about 6.7% (ternary quantization), indicating that the method effectively balances compression with model performance.

Practical Implications

With QMoE, institutions and developers can execute trillion-parameter models without the previously prohibitive hardware requirements, potentially democratizing access to cutting-edge AI technologies. This holds promise for increased experimentation, fine-tuning, and real-world application of LLMs across various domains.

Theoretical Implications

The work opens avenues for rethinking compression approaches for vast model sizes, suggesting that much higher levels of parameter reduction are feasible than previously anticipated, even for complex and high-performance models.

Future Directions

While the present paper focuses on pre-trained models, future research could explore compression during or post-finetuning for specific applications, potentially leveraging QMoE techniques in combination with methods like QLoRA. Extending QMoE to other emerging MoE architectures could further bolster its applicability.

Conclusion

In summary, Frantar and Alistarh's work on QMoE provides a practical solution to the substantial memory demands of massive MoE models, enabling broader accessibility and setting the stage for ongoing advances in model compression for AI deployments.