Sigma-Delta Quantization for LLMs

Updated 26 March 2026

The paper introduces SDQ-LLM, a framework that applies sigma-delta quantization with spectral upsampling to enable efficient low-bit representation of neural network weights.
It employs dynamic OSR scheduling and Hadamard-based weight smoothing to optimize memory usage and reduce high-frequency quantization noise.
Empirical benchmarks demonstrate competitive perplexity and zero-shot accuracy while achieving faster quantization times and lower effective bitrates.

Sigma-Delta Quantization for LLMs (SDQ-LLM) is a framework designed to enable extremely low-bit quantization of LLMs, including binarization (1-bit) and ternarization (1.58-bit), while maintaining linguistic reasoning capacity and allowing flexible adaptation to resource constraints. This is achieved by integrating sigma-delta (Σ–Δ) quantization with spectral upsampling, dynamic oversampling ratio (OSR) allocation, Hadamard-based weight smoothing, and fine-grained OSR scheduling (MultiOSR). SDQ-LLM explicitly replaces linear layer multiplications with additions, enhancing inference efficiency under highly quantized regimes (Xia et al., 27 Sep 2025).

1. Sigma-Delta Quantization Theory

SDQ-LLM applies sigma-delta quantization to neural network weights by first spectrally upsampling each weight vector $W\in\mathbb R^d$ by a factor OSR $=M$ , typically via zero-padding in the fast Fourier transform (FFT) domain. The upsampled sequence $x[n]$ , for $n=0,\dots,Md-1$ , is processed by a first-order sigma-delta loop: $\begin{aligned} e[n] &= e[n-1] + x[n] - y[n-1],\quad e[-1]=0,\ y[n] &= Q(e[n]) \end{aligned}$ where $Q(\cdot)$ denotes the quantizer. For 1-bit quantization, $Q(u) = \mathrm{sign}(u) \in \{-1, +1\}$ , and for ternary, $Q(u) \in \{-1, 0, +1\}$ . The sequence $y[n]$ is then decimated, via inverse FFT-based downsampling, to produce the quantized weight vector $\hat W$ .

Sigma-delta quantization shapes quantization noise by pushing it into higher frequencies—formally, taking Z-transforms yields

$Y(z) = X(z) + (1 - z^{-1})\,E(z)$

implying high-pass filtering of the quantization noise with $H_e(z) = 1 - z^{-1}$ . When employed in downstream linear operations ( $a\,\hat W^\top$ ), the low-pass nature of input activations substantially attenuates this high-frequency noise, as established by Parseval’s theorem.

2. Oversampling Ratio (OSR) Design and Bit-Rate Control

Unlike traditional analog-to-digital sigma-delta schemes that restrict OSR to integer values, SDQ-LLM generalizes OSR $M$ to any real value $M > 1$ via fractional zero-padding/truncated resampling in the FFT domain. This enables dynamic, continuous control over the trade-off between quantization precision (memory footprint) and model accuracy, with fractional OSR values (e.g., 2.5) being directly supported.

The effective bit-rate per original weight is determined by the quantizer and the OSR:

For binary quantization ( $Q\in\{-1, +1\}$ ):

$b_{\rm eff} = \frac{1}{M}$

For general $N$ -ary quantization:

$b_{\rm eff} = \frac{N}{\mathrm{OSR}}$

where $N = \log_2 3 \approx 1.58$ for ternary. The OSR can be selected to target a prescribed effective bit-rate or model size constraint.

3. Hadamard-Based Weight Smoothing

To reduce quantization-induced variance—especially outliers—SDQ-LLM incorporates a blockwise Hadamard transform prior to quantization. For each block $W\in\mathbb{R}^{B\times B}$ , the transformation is

$W' = H\,W\,H^\top$

where $H$ is the Hadamard matrix with entries in $\{\pm1\}$ , and $H H^\top = B I$ , making this an energy-preserving orthogonal rotation up to scale. Empirically, this operation concentrates weight energy in low to mid spectral bands. As sigma-delta pushes quantization noise to high frequencies, this spectral alignment further reduces in-band quantization error.

4. MultiOSR: Fine-Grained OSR Allocation

Recognizing heterogeneity in quantization sensitivity across model layers and within linear submodules, SDQ-LLM implements MultiOSR—a fine-grained OSR allocation strategy. For each linear layer $W_{l,i}$ in block $l$ , sensitivity is measured by variance $\sigma^2_{l,i} = \mathrm{Var}(W_{l,i})$ and scale $s_{l,i} = \|W_{l,i}\|_F$ . Higher OSR is allocated to layers with lower variance (higher sensitivity) and larger parameter scale.

The unnormalized allocation score is

$r_{l,i} = \frac{\sigma^2_{l,i}}{\sum_{l',j}\sigma^2_{l',j}} \times \frac{s_{l,i}}{\sum_{l',j}s_{l',j}}$

Normalized OSR for each layer is then set by

$\mathrm{OSR}_{l,i} = \overline M \cdot \frac{r_{l,i}}{\sum_{l',j} r_{l',j}}$

where $\overline M$ is the global OSR budget. This allocation proceeds in two stages—initially across layers, and subsequently within each layer—to stabilize the assignment.

5. Empirical Benchmarks and Comparative Evaluation

SDQ-LLM achieves efficient, low-bit quantization across a range of LLMs, demonstrating competitive perplexity (PPL), zero-shot accuracy, and conversion times against established methods. Key empirical results include:

Perplexity on WikiText2 (Ternary, OSR=2):

Model	Method	Weight-bit	PPL ↓
OPT	Full 16-bit	16	14.62
	RTN	2	12782.84
	GPTQ	2	107.65
	PB-LLM	1.7	280.42
	BiLLM	1.11	70.06
	SDQ (OSR=2)	1.58	38.24
LLaMA2-7B	Full 16-bit	16	5.47
	GPTQ	2	63.22
	PB-LLM	1.7	73.99
	BiLLM	1.08	29.06
	SDQ (OSR=2)	1.58	14.06

Zero-Shot Accuracy (OPT-6.7B):

Full 16-bit: 59.07%
GPTQ 2-bit: 47.81%
PB-LLM 1.7-bit: 41.56%
BiLLM 1.11-bit: 41.22%
SDQ (OSR=2) 1.58-bit: 53.91%

PPL vs OSR (OPT-6.7B):

$\text{PPL}_{\rm Wiki2}(M)$

is a concave decreasing function of $M \in [1, 4]$ , with PPL ≳2000 at $M=1$ , PPL≈200 at $M=1.5$ , PPL≈14.9 at $M=2$ , and diminishing improvements beyond $M=2.5$ . The optimal memory/accuracy trade-off is at $M\approx2$ –$2.5$.

Ablation on LLaMA3-8B (OSR=2), WikiText2 Perplexity:

SDQ w/o Hadamard, no MultiOSR: 2434.9
SDQ +Hadamard only: 20.13
SDQ +MultiOSR only: 2751.1
SDQ +Hadamard+MultiOSR: 17.02

Quantization time (OPT, 1.3B–13B):

SDQ-LLM: 70–540 s
GPTQ: 90–616 s
PB-LLM: 140–778 s
BiLLM: 360–1980 s

6. Summary and Relevance

SDQ-LLM introduces a suite of innovations—continuous OSR, sigma-delta noise shaping, Hadamard smoothing, and MultiOSR allocation—that enable robust, extremely low-bit quantization of LLMs with minimal impact on perplexity or accuracy. OSR’s flexibility provides precise adaptation to memory/V RAM budgets, and the method demonstrates competitive or superior empirical performance compared to prior 1–2-bit quantization schemes on OPT and LLaMA architectures, even at effective bitrates as low as 0.79 bits/weight (Xia et al., 27 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sigma-Delta Quantization for Large Language Models (SDQ-LLM).

Sigma-Delta Quantization for LLMs

1. Sigma-Delta Quantization Theory

2. Oversampling Ratio (OSR) Design and Bit-Rate Control

3. Hadamard-Based Weight Smoothing

4. MultiOSR: Fine-Grained OSR Allocation

5. Empirical Benchmarks and Comparative Evaluation

6. Summary and Relevance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sigma-Delta Quantization for LLMs

1. Sigma-Delta Quantization Theory

2. Oversampling Ratio (OSR) Design and Bit-Rate Control

3. Hadamard-Based Weight Smoothing

4. MultiOSR: Fine-Grained OSR Allocation

5. Empirical Benchmarks and Comparative Evaluation

6. Summary and Relevance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research