Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sigma-Delta Quantization for LLMs

Updated 26 March 2026
  • The paper introduces SDQ-LLM, a framework that applies sigma-delta quantization with spectral upsampling to enable efficient low-bit representation of neural network weights.
  • It employs dynamic OSR scheduling and Hadamard-based weight smoothing to optimize memory usage and reduce high-frequency quantization noise.
  • Empirical benchmarks demonstrate competitive perplexity and zero-shot accuracy while achieving faster quantization times and lower effective bitrates.

Sigma-Delta Quantization for LLMs (SDQ-LLM) is a framework designed to enable extremely low-bit quantization of LLMs, including binarization (1-bit) and ternarization (1.58-bit), while maintaining linguistic reasoning capacity and allowing flexible adaptation to resource constraints. This is achieved by integrating sigma-delta (Σ–Δ) quantization with spectral upsampling, dynamic oversampling ratio (OSR) allocation, Hadamard-based weight smoothing, and fine-grained OSR scheduling (MultiOSR). SDQ-LLM explicitly replaces linear layer multiplications with additions, enhancing inference efficiency under highly quantized regimes (Xia et al., 27 Sep 2025).

1. Sigma-Delta Quantization Theory

SDQ-LLM applies sigma-delta quantization to neural network weights by first spectrally upsampling each weight vector WRdW\in\mathbb R^d by a factor OSR =M=M, typically via zero-padding in the fast Fourier transform (FFT) domain. The upsampled sequence x[n]x[n], for n=0,,Md1n=0,\dots,Md-1, is processed by a first-order sigma-delta loop: e[n]=e[n1]+x[n]y[n1],e[1]=0, y[n]=Q(e[n])\begin{aligned} e[n] &= e[n-1] + x[n] - y[n-1],\quad e[-1]=0,\ y[n] &= Q(e[n]) \end{aligned} where Q()Q(\cdot) denotes the quantizer. For 1-bit quantization, Q(u)=sign(u){1,+1}Q(u) = \mathrm{sign}(u) \in \{-1, +1\}, and for ternary, Q(u){1,0,+1}Q(u) \in \{-1, 0, +1\}. The sequence y[n]y[n] is then decimated, via inverse FFT-based downsampling, to produce the quantized weight vector W^\hat W.

Sigma-delta quantization shapes quantization noise by pushing it into higher frequencies—formally, taking Z-transforms yields

Y(z)=X(z)+(1z1)E(z)Y(z) = X(z) + (1 - z^{-1})\,E(z)

implying high-pass filtering of the quantization noise with He(z)=1z1H_e(z) = 1 - z^{-1}. When employed in downstream linear operations (aW^a\,\hat W^\top), the low-pass nature of input activations substantially attenuates this high-frequency noise, as established by Parseval’s theorem.

2. Oversampling Ratio (OSR) Design and Bit-Rate Control

Unlike traditional analog-to-digital sigma-delta schemes that restrict OSR to integer values, SDQ-LLM generalizes OSR MM to any real value M>1M > 1 via fractional zero-padding/truncated resampling in the FFT domain. This enables dynamic, continuous control over the trade-off between quantization precision (memory footprint) and model accuracy, with fractional OSR values (e.g., 2.5) being directly supported.

The effective bit-rate per original weight is determined by the quantizer and the OSR:

  • For binary quantization (Q{1,+1}Q\in\{-1, +1\}):

beff=1Mb_{\rm eff} = \frac{1}{M}

  • For general NN-ary quantization:

beff=NOSRb_{\rm eff} = \frac{N}{\mathrm{OSR}}

where N=log231.58N = \log_2 3 \approx 1.58 for ternary. The OSR can be selected to target a prescribed effective bit-rate or model size constraint.

3. Hadamard-Based Weight Smoothing

To reduce quantization-induced variance—especially outliers—SDQ-LLM incorporates a blockwise Hadamard transform prior to quantization. For each block WRB×BW\in\mathbb{R}^{B\times B}, the transformation is

W=HWHW' = H\,W\,H^\top

where HH is the Hadamard matrix with entries in {±1}\{\pm1\}, and HH=BIH H^\top = B I, making this an energy-preserving orthogonal rotation up to scale. Empirically, this operation concentrates weight energy in low to mid spectral bands. As sigma-delta pushes quantization noise to high frequencies, this spectral alignment further reduces in-band quantization error.

4. MultiOSR: Fine-Grained OSR Allocation

Recognizing heterogeneity in quantization sensitivity across model layers and within linear submodules, SDQ-LLM implements MultiOSR—a fine-grained OSR allocation strategy. For each linear layer Wl,iW_{l,i} in block ll, sensitivity is measured by variance σl,i2=Var(Wl,i)\sigma^2_{l,i} = \mathrm{Var}(W_{l,i}) and scale sl,i=Wl,iFs_{l,i} = \|W_{l,i}\|_F. Higher OSR is allocated to layers with lower variance (higher sensitivity) and larger parameter scale.

The unnormalized allocation score is

rl,i=σl,i2l,jσl,j2×sl,il,jsl,jr_{l,i} = \frac{\sigma^2_{l,i}}{\sum_{l',j}\sigma^2_{l',j}} \times \frac{s_{l,i}}{\sum_{l',j}s_{l',j}}

Normalized OSR for each layer is then set by

OSRl,i=Mrl,il,jrl,j\mathrm{OSR}_{l,i} = \overline M \cdot \frac{r_{l,i}}{\sum_{l',j} r_{l',j}}

where M\overline M is the global OSR budget. This allocation proceeds in two stages—initially across layers, and subsequently within each layer—to stabilize the assignment.

5. Empirical Benchmarks and Comparative Evaluation

SDQ-LLM achieves efficient, low-bit quantization across a range of LLMs, demonstrating competitive perplexity (PPL), zero-shot accuracy, and conversion times against established methods. Key empirical results include:

Perplexity on WikiText2 (Ternary, OSR=2):

Model Method Weight-bit PPL ↓
OPT Full 16-bit 16 14.62
RTN 2 12782.84
GPTQ 2 107.65
PB-LLM 1.7 280.42
BiLLM 1.11 70.06
SDQ (OSR=2) 1.58 38.24
LLaMA2-7B Full 16-bit 16 5.47
GPTQ 2 63.22
PB-LLM 1.7 73.99
BiLLM 1.08 29.06
SDQ (OSR=2) 1.58 14.06

Zero-Shot Accuracy (OPT-6.7B):

  • Full 16-bit: 59.07%
  • GPTQ 2-bit: 47.81%
  • PB-LLM 1.7-bit: 41.56%
  • BiLLM 1.11-bit: 41.22%
  • SDQ (OSR=2) 1.58-bit: 53.91%

PPL vs OSR (OPT-6.7B):

PPLWiki2(M)\text{PPL}_{\rm Wiki2}(M)

is a concave decreasing function of M[1,4]M \in [1, 4], with PPL ≳2000 at M=1M=1, PPL≈200 at M=1.5M=1.5, PPL≈14.9 at M=2M=2, and diminishing improvements beyond M=2.5M=2.5. The optimal memory/accuracy trade-off is at M2M\approx2–$2.5$.

Ablation on LLaMA3-8B (OSR=2), WikiText2 Perplexity:

  • SDQ w/o Hadamard, no MultiOSR: 2434.9
  • SDQ +Hadamard only: 20.13
  • SDQ +MultiOSR only: 2751.1
  • SDQ +Hadamard+MultiOSR: 17.02

Quantization time (OPT, 1.3B–13B):

  • SDQ-LLM: 70–540 s
  • GPTQ: 90–616 s
  • PB-LLM: 140–778 s
  • BiLLM: 360–1980 s

6. Summary and Relevance

SDQ-LLM introduces a suite of innovations—continuous OSR, sigma-delta noise shaping, Hadamard smoothing, and MultiOSR allocation—that enable robust, extremely low-bit quantization of LLMs with minimal impact on perplexity or accuracy. OSR’s flexibility provides precise adaptation to memory/V RAM budgets, and the method demonstrates competitive or superior empirical performance compared to prior 1–2-bit quantization schemes on OPT and LLaMA architectures, even at effective bitrates as low as 0.79 bits/weight (Xia et al., 27 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sigma-Delta Quantization for Large Language Models (SDQ-LLM).