Sigma–Delta Quantization (SDQ-LLM)

Updated 17 May 2026

SDQ-LLM is a quantization framework that applies classical sigma–delta noise shaping, oversampling, and error-feedback to convert high-precision LLM weights into 1-bit or ternary representations.
It incorporates Hadamard-based weight smoothing and MultiOSR allocation to minimize quantization error while balancing model compression and accuracy.
Empirical results show that SDQ-LLM achieves lower perplexity and faster computation compared to uniform quantization and other state-of-the-art methods.

Sigma–Delta Quantization (SDQ-LLM) is a quantization framework that extends classical Sigma–Delta (Σ–Δ) noise-shaping principles from analog-to-digital conversion and compressed sensing to the low-bit quantization of LLMs. SDQ-LLM leverages oversampling, error-feedback noise shaping, and structured resampling to encode high-precision neural network weights into ultra-low-bit formats—typically 1-bit (binarized) or ternary. The method enables a controlled trade-off between model compression and inference accuracy through a tunable “Over-Sampling Ratio” (OSR), and incorporates additional techniques such as Hadamard weight smoothing and variance-aware multi-level OSR allocation. Theoretical underpinnings build on robust recovery guarantees for noise-shaped quantization in random frames and compressed sensing, while recent extensions demonstrate practical scaling and empirical performance in state-of-the-art transformer-based LLMs.

1. Sigma–Delta Quantization: Fundamentals and Contrast with Uniform Approaches

Uniform (memoryless) quantization methods such as Round-To-Nearest (RTN) independently map each weight to the nearest codebook entry, producing uncorrelated quantization errors across model parameters. In contrast, Sigma–Delta quantization introduces memory and feedback, shaping quantization noise to higher frequencies where it can later be suppressed by subsequent averaging or low-pass operations.

For a sequence $x_n$ , the core first-order Σ–Δ quantizer maintains an internal state $i_n$ , yielding outputs

$i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$

with $Q$ a low-bit quantizer. In the $Z$ -domain, this can be analyzed as

$Y(z) = X(z) + (1 - z^{-1})\,E(z)$

which enforces that quantization noise $E(z)$ is high-passed, concentrating error power at (discrete) high frequencies. When weights are later applied in neural network computations, and these computations are robust to high-frequency weight perturbations, this results in reduced in-band quantization error relative to uniform quantization (Xia et al., 27 Sep 2025).

2. Mathematical Formulation of SDQ-LLM Workflow

SDQ-LLM applies Sigma–Delta quantization in the following structured pipeline:

Weight Block Preprocessing: Each row of a weight matrix is treated as a digital signal.
Resampling (Upsampling): Each row is “resampled” (e.g., FFT zero-padding) from size $d_{\rm col}$ to $n d_{\rm col}$ ( $n=$ OSR), spreading the information and providing degrees of freedom for noise shaping.
Σ–Δ Loop: The oversampled sequence is quantized via a Σ–Δ loop, producing a binarized or ternarized quantized block.
Downsampling: The quantized block is returned to the original dimension by frequency-domain truncation or averaging.
Error-Feedback Correction: Block-wise error-feedback, similar to that used in GPTQ, is applied: the quantization error for each block is computed and propagated to future blocks using inverse Hessian-based correction (Xia et al., 27 Sep 2025).

During inference, activations are upsampled via FFT or interpolation to match the oversampling factor, ensuring correct linear-compute alignment.

3. Over-Sampling Ratio (OSR): Continuous Trade-Off and Compression Metrics

Unlike analog Σ–Δ ADCs, which restrict OSR to integral values, SDQ-LLM permits fractional OSR to finely adjust compression–accuracy trade-offs: $i_n$ 0 where $i_n$ 1 is the bit-width of the quantizer ( $i_n$ 2 for ternary). For example, OSR=1.5 with ternary quantization achieves a compression ratio of approximately $i_n$ 3 relative to 16-bit float weights. Empirical results demonstrate a smooth, concave relation between perplexity and OSR, enabling dynamic adaptation to hardware or memory constraints without re-training (Xia et al., 27 Sep 2025).

4. Noise Shaping, Hadamard Weight Smoothing, and MultiOSR

Extremely low-bit quantization can suffer from large errors due to heavy-tailed weight distributions in LLMs. SDQ-LLM incorporates:

Hadamard-Based Weight Smoothing: Applying a randomized Hadamard transform to each weight block flattens outlier weights, compresses energy to low/mid frequencies, and makes the overall weight distribution more amenable to high-frequency quantization noise shaping. After quantization, the inverse transform restores original basis alignment. Empirical ablations show that Hadamard smoothing is critical for usable accuracy at low OSR, e.g., reducing PPL from $i_n$ 4 to $i_n$ 5 in LLaMA3-8B models at OSR=2 (Xia et al., 27 Sep 2025).
MultiOSR (Layer- and Linear-Wise OSR Allocation): By analyzing weight variance at both layer and submodule granularity, OSR is distributed preferentially to blocks with lower variance and/or larger parameter size, which are more sensitive to quantization noise. Exact allocation is conducted by:

$i_n$ 6

and analogous formulas within each submodule, with normalization to maintain a global OSR budget. This fine-grained allocation further reduces perplexity when combined with smoothing.

5. Theoretical Guarantees for Sigma–Delta Quantization: Compressed Sensing Foundations

Classical Σ–Δ quantization in random frames and compressed sensing exploits oversampling to achieve error decay rates unattainable by memoryless quantization:

For $i_n$ 7th-order Σ–Δ in a $i_n$ 8 random Gaussian compressed sensing matrix, Sobolev-dual frame reconstruction achieves:

$i_n$ 9

for any $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 0, provided $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 1 (Güntürk et al., 2010). Here, $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 2 is the quantizer step size, $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 3 is the signal sparsity, and $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 4 is the number of measurements.

In the PTQ regime, polynomial and even root-exponential decay of error in the oversampling ratio are achievable, substantially outperforming the $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 5 “flat” error plateau of memoryless schemes (Saab et al., 2015).

A plausible implication is that such noise-shaping guarantees provide the theoretical justification for the observed empirical improvements in SDQ-LLM when compared to uniform low-bit quantization in high-dimensional neural settings.

6. Empirical Performance and Implementation Notes

Experimental results confirm the practical efficacy of SDQ-LLM:

On WikiText2, applying SDQ-LLM to OPT-1.3B with OSR=2 and ternary quantization yields PPL $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 6, far outperforming 2-bit RTN (catastrophic) and GPTQ ( $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 7), and achieving $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 8 lower PPL than BiLLM, despite lower average bitwidth (Xia et al., 27 Sep 2025).
On zero-shot downstream tasks using OPT and LLaMA families (1–13B), SDQ-LLM recovers $i_n = i_{n-1} + x_n - y_{n-1}, \quad y_n = Q(i_n)$ 9– $Q$ 0 of full-precision accuracy with OSR=2, while outperforming 2-bit GPTQ, PB-LLM, and BiLLM.
Quantization time for SDQ-LLM's PTQ on OPT-13B is $Q$ 1s, faster than PB-LLM or BiLLM and only modestly slower than GPTQ.
Multiplications in quantized matmuls reduce to additions and bit-packing operations, affording direct computational savings on suitable hardware.

7. Limitations and Future Directions

SDQ-LLM, as presently constructed, is most efficient for OSR $Q$ 2; lower OSR values still yield high error as measured by PPL. Potential advancements include higher-order Σ–Δ architectures and learned, task-specific OSR schedules (potentially via QAT). There is significant scope for custom hardware implementations that fully leverage the binary/ternary nature and noise-shaped properties of SDQ-LLM quantized models (Xia et al., 27 Sep 2025).

Table: Summary of Key SDQ-LLM Features and Empirical Results

Feature	Description / Metric	Source
Lowest bit-width achieved	1 bit (binary), 1.58 bit (ternary)	(Xia et al., 27 Sep 2025)
OSR control	Continuous, real-valued	(Xia et al., 27 Sep 2025)
Weight smoothing method	Hadamard blockwise	(Xia et al., 27 Sep 2025)
Sensitivity allocation	MultiOSR (variance + size)	(Xia et al., 27 Sep 2025)
Polynomial error decay	$Q$ 3	(Güntürk et al., 2010)
LLM PPL improvement	$Q$ 4 over BiLLM at same bits	(Xia et al., 27 Sep 2025)

In conclusion, SDQ-LLM generalizes and adapts Sigma–Delta quantization to the requirements of neural network quantization, combining mathematically principled noise-shaping, efficient blockwise implementation, and variance-aware adaptability to achieve state-of-the-art ultra-low-bit compression for large-scale LLMs.

Markdown Report Issue Upgrade to Chat

References (3)

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size (2025)

Sobolev Duals for Random Frames and Sigma-Delta Quantization of Compressed Sensing Measurements (2010)

Quantization of compressive samples with stable and robust recovery (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sigma–Delta Quantization (SDQ-LLM).

Sigma–Delta Quantization (SDQ-LLM)

1. Sigma–Delta Quantization: Fundamentals and Contrast with Uniform Approaches

2. Mathematical Formulation of SDQ-LLM Workflow

3. Over-Sampling Ratio (OSR): Continuous Trade-Off and Compression Metrics

4. Noise Shaping, Hadamard Weight Smoothing, and MultiOSR

5. Theoretical Guarantees for Sigma–Delta Quantization: Compressed Sensing Foundations

6. Empirical Performance and Implementation Notes

7. Limitations and Future Directions

Table: Summary of Key SDQ-LLM Features and Empirical Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sigma–Delta Quantization (SDQ-LLM)

1. Sigma–Delta Quantization: Fundamentals and Contrast with Uniform Approaches

2. Mathematical Formulation of SDQ-LLM Workflow

3. Over-Sampling Ratio (OSR): Continuous Trade-Off and Compression Metrics

4. Noise Shaping, Hadamard Weight Smoothing, and MultiOSR

5. Theoretical Guarantees for Sigma–Delta Quantization: Compressed Sensing Foundations

6. Empirical Performance and Implementation Notes

7. Limitations and Future Directions

Table: Summary of Key SDQ-LLM Features and Empirical Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research