Sigma-Delta Quantization for LLMs
- The paper introduces SDQ-LLM, a framework that applies sigma-delta quantization with spectral upsampling to enable efficient low-bit representation of neural network weights.
- It employs dynamic OSR scheduling and Hadamard-based weight smoothing to optimize memory usage and reduce high-frequency quantization noise.
- Empirical benchmarks demonstrate competitive perplexity and zero-shot accuracy while achieving faster quantization times and lower effective bitrates.
Sigma-Delta Quantization for LLMs (SDQ-LLM) is a framework designed to enable extremely low-bit quantization of LLMs, including binarization (1-bit) and ternarization (1.58-bit), while maintaining linguistic reasoning capacity and allowing flexible adaptation to resource constraints. This is achieved by integrating sigma-delta (Σ–Δ) quantization with spectral upsampling, dynamic oversampling ratio (OSR) allocation, Hadamard-based weight smoothing, and fine-grained OSR scheduling (MultiOSR). SDQ-LLM explicitly replaces linear layer multiplications with additions, enhancing inference efficiency under highly quantized regimes (Xia et al., 27 Sep 2025).
1. Sigma-Delta Quantization Theory
SDQ-LLM applies sigma-delta quantization to neural network weights by first spectrally upsampling each weight vector by a factor OSR , typically via zero-padding in the fast Fourier transform (FFT) domain. The upsampled sequence , for , is processed by a first-order sigma-delta loop: where denotes the quantizer. For 1-bit quantization, , and for ternary, . The sequence is then decimated, via inverse FFT-based downsampling, to produce the quantized weight vector .
Sigma-delta quantization shapes quantization noise by pushing it into higher frequencies—formally, taking Z-transforms yields
implying high-pass filtering of the quantization noise with . When employed in downstream linear operations (), the low-pass nature of input activations substantially attenuates this high-frequency noise, as established by Parseval’s theorem.
2. Oversampling Ratio (OSR) Design and Bit-Rate Control
Unlike traditional analog-to-digital sigma-delta schemes that restrict OSR to integer values, SDQ-LLM generalizes OSR to any real value via fractional zero-padding/truncated resampling in the FFT domain. This enables dynamic, continuous control over the trade-off between quantization precision (memory footprint) and model accuracy, with fractional OSR values (e.g., 2.5) being directly supported.
The effective bit-rate per original weight is determined by the quantizer and the OSR:
- For binary quantization ():
- For general -ary quantization:
where for ternary. The OSR can be selected to target a prescribed effective bit-rate or model size constraint.
3. Hadamard-Based Weight Smoothing
To reduce quantization-induced variance—especially outliers—SDQ-LLM incorporates a blockwise Hadamard transform prior to quantization. For each block , the transformation is
where is the Hadamard matrix with entries in , and , making this an energy-preserving orthogonal rotation up to scale. Empirically, this operation concentrates weight energy in low to mid spectral bands. As sigma-delta pushes quantization noise to high frequencies, this spectral alignment further reduces in-band quantization error.
4. MultiOSR: Fine-Grained OSR Allocation
Recognizing heterogeneity in quantization sensitivity across model layers and within linear submodules, SDQ-LLM implements MultiOSR—a fine-grained OSR allocation strategy. For each linear layer in block , sensitivity is measured by variance and scale . Higher OSR is allocated to layers with lower variance (higher sensitivity) and larger parameter scale.
The unnormalized allocation score is
Normalized OSR for each layer is then set by
where is the global OSR budget. This allocation proceeds in two stages—initially across layers, and subsequently within each layer—to stabilize the assignment.
5. Empirical Benchmarks and Comparative Evaluation
SDQ-LLM achieves efficient, low-bit quantization across a range of LLMs, demonstrating competitive perplexity (PPL), zero-shot accuracy, and conversion times against established methods. Key empirical results include:
Perplexity on WikiText2 (Ternary, OSR=2):
| Model | Method | Weight-bit | PPL ↓ |
|---|---|---|---|
| OPT | Full 16-bit | 16 | 14.62 |
| RTN | 2 | 12782.84 | |
| GPTQ | 2 | 107.65 | |
| PB-LLM | 1.7 | 280.42 | |
| BiLLM | 1.11 | 70.06 | |
| SDQ (OSR=2) | 1.58 | 38.24 | |
| LLaMA2-7B | Full 16-bit | 16 | 5.47 |
| GPTQ | 2 | 63.22 | |
| PB-LLM | 1.7 | 73.99 | |
| BiLLM | 1.08 | 29.06 | |
| SDQ (OSR=2) | 1.58 | 14.06 |
Zero-Shot Accuracy (OPT-6.7B):
- Full 16-bit: 59.07%
- GPTQ 2-bit: 47.81%
- PB-LLM 1.7-bit: 41.56%
- BiLLM 1.11-bit: 41.22%
- SDQ (OSR=2) 1.58-bit: 53.91%
PPL vs OSR (OPT-6.7B):
is a concave decreasing function of , with PPL ≳2000 at , PPL≈200 at , PPL≈14.9 at , and diminishing improvements beyond . The optimal memory/accuracy trade-off is at –$2.5$.
Ablation on LLaMA3-8B (OSR=2), WikiText2 Perplexity:
- SDQ w/o Hadamard, no MultiOSR: 2434.9
- SDQ +Hadamard only: 20.13
- SDQ +MultiOSR only: 2751.1
- SDQ +Hadamard+MultiOSR: 17.02
Quantization time (OPT, 1.3B–13B):
- SDQ-LLM: 70–540 s
- GPTQ: 90–616 s
- PB-LLM: 140–778 s
- BiLLM: 360–1980 s
6. Summary and Relevance
SDQ-LLM introduces a suite of innovations—continuous OSR, sigma-delta noise shaping, Hadamard smoothing, and MultiOSR allocation—that enable robust, extremely low-bit quantization of LLMs with minimal impact on perplexity or accuracy. OSR’s flexibility provides precise adaptation to memory/V RAM budgets, and the method demonstrates competitive or superior empirical performance compared to prior 1–2-bit quantization schemes on OPT and LLaMA architectures, even at effective bitrates as low as 0.79 bits/weight (Xia et al., 27 Sep 2025).