Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

SQ-LLM: Sigma-Delta Quantization for LLMs

Updated 17 October 2025
  • SQ-LLM is a quantization framework for LLMs that employs sigma-delta methods with dynamic oversampling and Hadamard smoothing to preserve linguistic and reasoning capabilities.
  • It achieves ultra-low-bit representations (1-bit and 1.58-bit) while significantly reducing compute and memory demands, demonstrated on models like OPT and LLaMA.
  • The approach converts multiplication operations to additions, enabling efficient inference and flexible, per-layer precision allocation for resource-constrained environments.

Sigma-Delta Quantization for LLMs (SDQ-LLM) is a quantization framework designed to achieve extremely low-bit (down to 1-bit and 1.58-bit) representations of transformer models with robust preservation of linguistic and reasoning capabilities. It combines sigma-delta quantization, a continuous and adjustable oversampling ratio (OSR), Hadamard-based weight smoothing, and multi-granular OSR allocation, providing efficient inference and flexible adaptation to hardware or memory constraints. SDQ-LLM enables high compression ratios while mitigating quantization-induced accuracy loss, advancing the practicality of deploying massive LLMs on memory-limited and resource-constrained devices (Xia et al., 27 Sep 2025).

1. Sigma-Delta Quantization Methodology

Central to SDQ-LLM is the adaptation of first-order sigma-delta quantization (SDQ), inspired by oversampling and noise-shaping in analog-to-digital conversion, to compress transformer weight matrices:

  • Recursive Quantization Process: For each weight sequence (xn)(x_n) the sigma-delta quantizer maintains an integrator variable ini_n updated as

in=in1+xnyn1i_n = i_{n-1} + x_n - y_{n-1}

where yny_n is the quantized output. Applying a quantization operator Q()Q(\cdot) (either binarization or ternarization):

yn=Q(in)y_n = Q(i_n)

  • Noise Shaping Principle: In the z-domain, the process yields

Y(z)=X(z)+(1z1)E(z)Y(z) = X(z) + (1 - z^{-1})E(z)

meaning quantization noise E(z)E(z) is shifted to higher frequencies. This high-pass filtering of quantization error reduces its effect on model performance.

  • Low-Bit Representations: SDQ-LLM supports both 1-bit (binarization: Wb=αsign(W)W_b = \alpha\cdot \operatorname{sign}(W)) and 1.58-bit (ternarization: Wt{1,0,+1}W_t \in \{-1,0,+1\}) quantization, using oversampling and sigma-delta encoding to preserve information.

By converting matrix multiplications in transformer blocks to addition-based operations (as the quantized weights are either +1+1, 1-1, or $0$), SDQ-LLM significantly reduces compute demands for inference.

2. Continuous and Fine-Grained Over-Sampling Ratio (OSR)

A distinctive contribution of SDQ-LLM is the continuous, layer- and module-wise adjustable OSR, determining the upsampling factor during quantization:

  • Precision/Compression Trade-off: Higher OSR (e.g., 2.5×\times) increases effective information capture, reducing quantization error but incurring greater storage cost. The compression rate is given by:

η=NOSR16\eta = \frac{N\,{\cdot}\,\mathrm{OSR}}{16}

where NN is the number of quantization bits per weight.

  • Fractional OSR Values: Unlike prior frameworks (which use fixed OSRs), SDQ-LLM allows dynamic selection of non-integer OSRs, supporting fine-grained, hardware-aware adaptation.
  • MultiOSR Allocation: Recognizing that quantization sensitivity relates to per-layer/per-module weight variance, SDQ-LLM's MultiOSR distributes OSR not only across layers but also across linear submodules within layers, ensuring that modules with low weight variance (higher information density) receive higher precision allocations.

This design achieves optimal balance between model size, inference throughput, and accuracy, with direct configuration to accommodate VRAM or memory budgets.

3. Hadamard-Based Weight Smoothing

Aggressive quantization is susceptible to accuracy loss from outliers and nonuniform weight distributions. SDQ-LLM introduces a Hadamard-based smoothing operation prior to quantization:

  • Smoothing Mechanism: The Hadamard transform is applied to the weight tensor, decorrelating and distributing outlier values and reducing local variance spikes.
  • Effectiveness: By flattening out weight statistics, subsequent sigma-delta quantization operates on more regularized data, effectively reducing quantization-induced instability and further curtailing accuracy loss, particularly in the presence of weight outliers.

This preprocessing step is jointly optimized with the quantization pipeline, and both the Hadamard transform and its inverse are implemented efficiently for deployment.

4. Experimental Results and Quantization Efficacy

SDQ-LLM is benchmarked on OPT and LLaMA model families (1.3B–13B, LLaMA2-7B–LLaMA3-8B):

  • Perplexity and Downstream Accuracy: SDQ-LLM with OSR=2 (1.58-bit ternary quantization) achieves lower or comparable perplexity on WikiText2 relative to RTN, GPTQ, PB-LLM, and BiLLM at equivalent or lower memory budgets.
  • Zero-shot Task Performance: Downstream evaluation on ARC, BoolQ, PIQA, and similar benchmarks confirms that SDQ-LLM maintains robust zero-shot transfer and reasoning, with negligible accuracy degradation even under highly aggressive quantization.
  • Quantization Speed: The pipeline demonstrates reduced quantization wall-clock time versus conventional post-training quantization methods, as the sigma-delta process is amenable to parallel, hardware-friendly implementation.

A summary table from the paper (adapted):

Model OSR Quantization Bits Perplexity (WikiText2) Memory Reduction
OPT-6.7B 2 1.58 lower than RTN/GPTQ >8×\times
LLaMA2-7B 2 1.58 comparable/Lower >8×\times

5. Practical Implications and Deployment

SDQ-LLM's ability to quantize LLMs to 1 or 1.58 bits per weight has several practical consequences:

  • Resource-Aware Deployment: It enables large models to be served on memory-constrained devices (e.g., last-generation GPUs, edge devices, and mobile hardware) previously incapable of hosting multi-billion parameter transformers.
  • Inference Efficiency: The conversion of multiplies to adds in forward passes reduces both energy and latency, supporting high-throughput real-time applications.
  • Adaptability: Developers can select and tune OSR globally or schedule per-module adjustments to fit operational constraints, without retraining from scratch or requiring extensive search.
  • No Backward Compatibility Loss: Since SDQ-LLM operates post-training, it can be applied to existing LLM checkpoints.

6. Code Availability and Reproducibility

All implementations, including the sigma-delta quantizer, Hadamard transform utilities, and MultiOSR allocation routines, are publicly available:

7. Directions for Further Research

Several avenues are highlighted as ongoing or future work:

  • Layer-Aware Outlier Extraction: Improving the metric for outlier selection beyond magnitude or variance—potentially integrating curvature or activation-based metrics.
  • Optimized Quantization Schemes: Exploration of different low-bit formats (including fp8-e4m3 and others) to further minimize error under tight memory budgets.
  • Hardware Validation: Simulation and empirical deployment on emerging hardware such as sparse tensor cores and dedicated LLM accelerators, with loop integration for continuous performance benchmarking.
  • Generalization to Other Architectures: Application and refinement for non-transformer models or variants with nonstandard attention/MLP blocks, including those found in multimodal and retrieval-augmented LLMs.

SDQ-LLM establishes a new state for aggressive quantization of large transformer models, demonstrating that with algorithmic innovations in noise shaping, smoothing, and flexible resource allocation, 1-bit and ternary LLM deployments can be realized with competitive accuracy and massive resource savings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SQ-LLM.