Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (2306.03078v1)

Published 5 Jun 2023 in cs.CL and cs.LG

Abstract: Recent advances in LLM pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especially for smaller models in the 1-10B parameter range, which are well-suited for edge deployments. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales, while reaching similar compression levels to previous methods. SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime. Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x.

Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

In the exploration of compressing LLMs for practical deployment on consumer-grade hardware, this paper introduces the Sparse-Quantized Representation (SpQR) approach. The purpose of SpQR is to achieve near-lossless compression of LLM weights, focusing on 3 to 4-bit quantization, with the ultimate goal of making these models viable for edge devices while preserving their performance. This approach effectively tackles the prevalent issue of accuracy degradation due to bit reduction in weight quantization.

Core Innovations and Methodology

The SpQR technique introduces a dual-stage system for managing quantization errors that typically arise during low-bitwidth weight compression:

  1. Outlier Isolation: The approach identifies the weights causing the most significant errors when quantized (outliers) and isolates them to be stored in higher precision. The outlier detection is predicated on assessing the sensitivity of each weight to quantization, calculated through an analysis that considers both direct rounding errors and correlations between weights.
  2. Bilevel Quantization: For weights not classified as outliers, SpQR applies a two-tier quantization strategy. First, small groups of weights (e.g., 16 each) are quantized, preserving fine granularity and capturing local data patterns. Second, the statistics for these groups are further compressed into very small quantized values themselves, maintaining overall compression ratios.

The post-training quantization (PTQ) algorithm used in SpQR employs the GPTQ system that conducts quantization in a column-wise order, dynamically calculating errors and compensating them using unquantized weights for optimal error distribution.

Experimental Validation

Results from experimental validation demonstrate the efficacy of the SpQR technique across different scales of pre-trained LLMs, including LLaMA and Falcon models. The paper provides empirical evidence that SpQR reduces the necessary bits to maintain high fidelity close to 16-bit baselines to 3.4-4.75 bits per parameter on average. SpQR not only supports compression without noticeable performance degradation across various perplexity benchmarks but also improves inference speed by 20-30% compared to full 16-bit operation in GPU-based environments.

Practical Implications and Future Directions

The implications of SpQR are profound for deploying high-performance LLMs in mobile and specialized environments constrained by memory and processing power. Improved compression enables the use of robust models in contexts previously unfeasible, offering opportunities for personalized, on-device AI applications without relying on cloud infrastructure.

From a theoretical perspective, SpQR exemplifies the potential of advanced quantization techniques to balance compression and accuracy, setting the stage for developing more sophisticated hybrid quantization schemes. Future work could focus on refining outlier identification, optimizing the integration of sparse and dense matrix operations in parallel computing architectures, and exploring the generative impact of compressed models to broaden the applicability and robustness of compressed AI systems.

In summary, the SpQR method represents a significant step towards practical and efficient deployment of large-scale AI models in consumer and edge environments. By achieving compression with negligible accuracy losses, this work paves the way for further innovation in model deployment and AI accessibility.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Tim Dettmers (22 papers)
  2. Ruslan Svirschevski (6 papers)
  3. Vage Egiazarian (16 papers)
  4. Denis Kuznedelev (21 papers)
  5. Elias Frantar (24 papers)
  6. Saleh Ashkboos (20 papers)
  7. Alexander Borzunov (7 papers)
  8. Torsten Hoefler (203 papers)
  9. Dan Alistarh (133 papers)
Citations (177)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com