Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sigma–Delta Quantization (SDQ-LLM)

Updated 17 May 2026
  • SDQ-LLM is a quantization framework that applies classical sigma–delta noise shaping, oversampling, and error-feedback to convert high-precision LLM weights into 1-bit or ternary representations.
  • It incorporates Hadamard-based weight smoothing and MultiOSR allocation to minimize quantization error while balancing model compression and accuracy.
  • Empirical results show that SDQ-LLM achieves lower perplexity and faster computation compared to uniform quantization and other state-of-the-art methods.

Sigma–Delta Quantization (SDQ-LLM) is a quantization framework that extends classical Sigma–Delta (Σ–Δ) noise-shaping principles from analog-to-digital conversion and compressed sensing to the low-bit quantization of LLMs. SDQ-LLM leverages oversampling, error-feedback noise shaping, and structured resampling to encode high-precision neural network weights into ultra-low-bit formats—typically 1-bit (binarized) or ternary. The method enables a controlled trade-off between model compression and inference accuracy through a tunable “Over-Sampling Ratio” (OSR), and incorporates additional techniques such as Hadamard weight smoothing and variance-aware multi-level OSR allocation. Theoretical underpinnings build on robust recovery guarantees for noise-shaped quantization in random frames and compressed sensing, while recent extensions demonstrate practical scaling and empirical performance in state-of-the-art transformer-based LLMs.

1. Sigma–Delta Quantization: Fundamentals and Contrast with Uniform Approaches

Uniform (memoryless) quantization methods such as Round-To-Nearest (RTN) independently map each weight to the nearest codebook entry, producing uncorrelated quantization errors across model parameters. In contrast, Sigma–Delta quantization introduces memory and feedback, shaping quantization noise to higher frequencies where it can later be suppressed by subsequent averaging or low-pass operations.

For a sequence xnx_n, the core first-order Σ–Δ quantizer maintains an internal state ini_n, yielding outputs

in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)

with QQ a low-bit quantizer. In the ZZ-domain, this can be analyzed as

Y(z)=X(z)+(1z1)E(z)Y(z) = X(z) + (1 - z^{-1})\,E(z)

which enforces that quantization noise E(z)E(z) is high-passed, concentrating error power at (discrete) high frequencies. When weights are later applied in neural network computations, and these computations are robust to high-frequency weight perturbations, this results in reduced in-band quantization error relative to uniform quantization (Xia et al., 27 Sep 2025).

2. Mathematical Formulation of SDQ-LLM Workflow

SDQ-LLM applies Sigma–Delta quantization in the following structured pipeline:

  1. Weight Block Preprocessing: Each row of a weight matrix is treated as a digital signal.
  2. Resampling (Upsampling): Each row is “resampled” (e.g., FFT zero-padding) from size dcold_{\rm col} to ndcoln d_{\rm col} (n=n= OSR), spreading the information and providing degrees of freedom for noise shaping.
  3. Σ–Δ Loop: The oversampled sequence is quantized via a Σ–Δ loop, producing a binarized or ternarized quantized block.
  4. Downsampling: The quantized block is returned to the original dimension by frequency-domain truncation or averaging.
  5. Error-Feedback Correction: Block-wise error-feedback, similar to that used in GPTQ, is applied: the quantization error for each block is computed and propagated to future blocks using inverse Hessian-based correction (Xia et al., 27 Sep 2025).

During inference, activations are upsampled via FFT or interpolation to match the oversampling factor, ensuring correct linear-compute alignment.

3. Over-Sampling Ratio (OSR): Continuous Trade-Off and Compression Metrics

Unlike analog Σ–Δ ADCs, which restrict OSR to integral values, SDQ-LLM permits fractional OSR to finely adjust compression–accuracy trade-offs: ini_n0 where ini_n1 is the bit-width of the quantizer (ini_n2 for ternary). For example, OSR=1.5 with ternary quantization achieves a compression ratio of approximately ini_n3 relative to 16-bit float weights. Empirical results demonstrate a smooth, concave relation between perplexity and OSR, enabling dynamic adaptation to hardware or memory constraints without re-training (Xia et al., 27 Sep 2025).

4. Noise Shaping, Hadamard Weight Smoothing, and MultiOSR

Extremely low-bit quantization can suffer from large errors due to heavy-tailed weight distributions in LLMs. SDQ-LLM incorporates:

  • Hadamard-Based Weight Smoothing: Applying a randomized Hadamard transform to each weight block flattens outlier weights, compresses energy to low/mid frequencies, and makes the overall weight distribution more amenable to high-frequency quantization noise shaping. After quantization, the inverse transform restores original basis alignment. Empirical ablations show that Hadamard smoothing is critical for usable accuracy at low OSR, e.g., reducing PPL from ini_n4 to ini_n5 in LLaMA3-8B models at OSR=2 (Xia et al., 27 Sep 2025).
  • MultiOSR (Layer- and Linear-Wise OSR Allocation): By analyzing weight variance at both layer and submodule granularity, OSR is distributed preferentially to blocks with lower variance and/or larger parameter size, which are more sensitive to quantization noise. Exact allocation is conducted by:

ini_n6

and analogous formulas within each submodule, with normalization to maintain a global OSR budget. This fine-grained allocation further reduces perplexity when combined with smoothing.

5. Theoretical Guarantees for Sigma–Delta Quantization: Compressed Sensing Foundations

Classical Σ–Δ quantization in random frames and compressed sensing exploits oversampling to achieve error decay rates unattainable by memoryless quantization:

  • For ini_n7th-order Σ–Δ in a ini_n8 random Gaussian compressed sensing matrix, Sobolev-dual frame reconstruction achieves:

ini_n9

for any in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)0, provided in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)1 (Güntürk et al., 2010). Here, in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)2 is the quantizer step size, in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)3 is the signal sparsity, and in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)4 is the number of measurements.

  • In the PTQ regime, polynomial and even root-exponential decay of error in the oversampling ratio are achievable, substantially outperforming the in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)5 “flat” error plateau of memoryless schemes (Saab et al., 2015).

A plausible implication is that such noise-shaping guarantees provide the theoretical justification for the observed empirical improvements in SDQ-LLM when compared to uniform low-bit quantization in high-dimensional neural settings.

6. Empirical Performance and Implementation Notes

Experimental results confirm the practical efficacy of SDQ-LLM:

  • On WikiText2, applying SDQ-LLM to OPT-1.3B with OSR=2 and ternary quantization yields PPL in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)6, far outperforming 2-bit RTN (catastrophic) and GPTQ (in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)7), and achieving in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)8 lower PPL than BiLLM, despite lower average bitwidth (Xia et al., 27 Sep 2025).
  • On zero-shot downstream tasks using OPT and LLaMA families (1–13B), SDQ-LLM recovers in=in1+xnyn1,yn=Q(in)i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)9–QQ0 of full-precision accuracy with OSR=2, while outperforming 2-bit GPTQ, PB-LLM, and BiLLM.
  • Quantization time for SDQ-LLM's PTQ on OPT-13B is QQ1s, faster than PB-LLM or BiLLM and only modestly slower than GPTQ.
  • Multiplications in quantized matmuls reduce to additions and bit-packing operations, affording direct computational savings on suitable hardware.

7. Limitations and Future Directions

SDQ-LLM, as presently constructed, is most efficient for OSR QQ2; lower OSR values still yield high error as measured by PPL. Potential advancements include higher-order Σ–Δ architectures and learned, task-specific OSR schedules (potentially via QAT). There is significant scope for custom hardware implementations that fully leverage the binary/ternary nature and noise-shaped properties of SDQ-LLM quantized models (Xia et al., 27 Sep 2025).

Table: Summary of Key SDQ-LLM Features and Empirical Results

Feature Description / Metric Source
Lowest bit-width achieved 1 bit (binary), 1.58 bit (ternary) (Xia et al., 27 Sep 2025)
OSR control Continuous, real-valued (Xia et al., 27 Sep 2025)
Weight smoothing method Hadamard blockwise (Xia et al., 27 Sep 2025)
Sensitivity allocation MultiOSR (variance + size) (Xia et al., 27 Sep 2025)
Polynomial error decay QQ3 (Güntürk et al., 2010)
LLM PPL improvement QQ4 over BiLLM at same bits (Xia et al., 27 Sep 2025)

In conclusion, SDQ-LLM generalizes and adapts Sigma–Delta quantization to the requirements of neural network quantization, combining mathematically principled noise-shaping, efficient blockwise implementation, and variance-aware adaptability to achieve state-of-the-art ultra-low-bit compression for large-scale LLMs.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sigma–Delta Quantization (SDQ-LLM).