Low-Rank Correction for Quantized LLMs (2412.07902v1)

Published 10 Dec 2024 in stat.ML and cs.LG

Abstract: We consider the problem of model compression for LLMs at post-training time, where the task is to compress a well-trained model using only a small set of calibration input data. In this work, we introduce a new low-rank approach to correct for quantization errors of \emph{activations} in LLMs: we propose to add low-rank weight matrices in full precision that act on the \emph{unquantized} activations. We then solve a joint optimization problem over the quantized representation of the weights and additional low-rank weight matrices to quantize both weights and activations. We focus on the case of 4-bit weight-and-activation quantization (W4A4). Using ranks equivalent to 10\% of the original weight matrix size, our approach reduces the accuracy gap with the original model by more than 50\%. Using ranks equivalent to 30\% of the original weight matrix, the accuracy gap is closed completely. We demonstrate our results on four recent LLMs, namely Llama-2, Llama-3, Phi-3 and Mixtral models.

Summary

The paper proposes a novel low-rank correction (LRC) method to significantly reduce quantization errors in 4-bit quantized Large Language Models (LLMs), particularly for both weights and activations (W4A4).
The methodology involves a joint, iterative optimization process that refines quantized weights while simultaneously optimizing full-precision low-rank matrices added to correct activation quantization errors.
Experimental results show that LRC with ranks equivalent to 10% of weight matrices reduces the accuracy gap by over 50%, and using 30% rank size closes the gap entirely, achieving performance comparable to original full-precision models.

An Analysis of Low-Rank Correction for Quantized LLMs

The paper "Low-Rank Correction for Quantized LLMs" addresses the critical challenge of compressing LLMs post-training to reduce their computational and memory requirements during inference. The proposed solution introduces a novel method of low-rank correction (LRC) to alleviate the errors arising from the quantization of both weights and activations in LLMs, focusing specifically on 4-bit quantization schemes for weights and activations (W4A4).

Methodology

The core idea presented in the paper is to mitigate the quantization errors that adversely impact LLM performance by adding low-rank weight matrices in full precision. These matrices act on the unquantized activations. The approach involves solving a joint optimization problem that entails quantizing the original weights and activations while simultaneously optimizing these low-rank matrices to correct activation quantization errors.

This low-rank correction framework begins by framing it as an optimization problem, which includes the following steps:

Weight Quantization: The paper acknowledges existing solvers for weight-only quantization (e.g., GPTQ) and refines them to also consider quantized activations in the joint optimization setup.
Low-Rank Matrix Addition: Full-precision low-rank matrices are formulated and introduced to act specifically on unquantized activations, effectively serving to minimize the quantization errors.
Iterative Optimization: The optimization strategy is an iterative one which alternates between updating the quantized weights and the low-rank matrices, with each update tailored to reducing the total quantization error.

Key Results

The experimental validation in the paper includes an impressive set of results across several modern LLM architectures such as Llama-2, Llama-3, Phi-3, and Mixtral. Quantifying the efficacy of the LRC method, the authors report that:

With ranks equivalent to only 10% of the original weight matrix sizes, LRC reduces the accuracy gap by more than 50% when compared to the original models.
Increasing the rank size to 30% enables the method to close the accuracy gap entirely, achieving performance on par with the original precision models.

Implications and Future Work

The LRC method represents a significant advancement in model compression techniques. Its capability to enhance both memory efficiency and inference speed while maintaining model accuracy extends its applicability across various deployments — particularly in resource-constrained environments such as mobile devices or edge computing scenarios.

Practical implications also involve its potential integration with state-of-the-art quantization techniques like QuaRot and SPQR, as the LRC framework is inherently compatible with these methods. The theoretical contributions further enrich the dialogue on the trade-offs between compression and accuracy in AI model deployment.

Future developments may focus on investigating alternative low-rank strategies or exploring the method's scalability to even more challenging bit constraints such as W4A2, or integrating LRC with more sophisticated online quantization techniques. There is also scope for optimizing the computational overhead introduced by the low-rank matrices themselves and a potential examination of low-rank methodologies in other model architectures beyond transformers.

In summary, the paper presents a robust method for post-training quantization of LLMs, demonstrating significant compression without performance loss. The results underscore the efficacy of low-rank correction in correcting quantization-induced errors, providing a promising direction for future research in efficient AI model deployment.