Sigma–Delta Quantization (SDQ-LLM)
- SDQ-LLM is a quantization framework that applies classical sigma–delta noise shaping, oversampling, and error-feedback to convert high-precision LLM weights into 1-bit or ternary representations.
- It incorporates Hadamard-based weight smoothing and MultiOSR allocation to minimize quantization error while balancing model compression and accuracy.
- Empirical results show that SDQ-LLM achieves lower perplexity and faster computation compared to uniform quantization and other state-of-the-art methods.
Sigma–Delta Quantization (SDQ-LLM) is a quantization framework that extends classical Sigma–Delta (Σ–Δ) noise-shaping principles from analog-to-digital conversion and compressed sensing to the low-bit quantization of LLMs. SDQ-LLM leverages oversampling, error-feedback noise shaping, and structured resampling to encode high-precision neural network weights into ultra-low-bit formats—typically 1-bit (binarized) or ternary. The method enables a controlled trade-off between model compression and inference accuracy through a tunable “Over-Sampling Ratio” (OSR), and incorporates additional techniques such as Hadamard weight smoothing and variance-aware multi-level OSR allocation. Theoretical underpinnings build on robust recovery guarantees for noise-shaped quantization in random frames and compressed sensing, while recent extensions demonstrate practical scaling and empirical performance in state-of-the-art transformer-based LLMs.
1. Sigma–Delta Quantization: Fundamentals and Contrast with Uniform Approaches
Uniform (memoryless) quantization methods such as Round-To-Nearest (RTN) independently map each weight to the nearest codebook entry, producing uncorrelated quantization errors across model parameters. In contrast, Sigma–Delta quantization introduces memory and feedback, shaping quantization noise to higher frequencies where it can later be suppressed by subsequent averaging or low-pass operations.
For a sequence , the core first-order Σ–Δ quantizer maintains an internal state , yielding outputs
with a low-bit quantizer. In the -domain, this can be analyzed as
which enforces that quantization noise is high-passed, concentrating error power at (discrete) high frequencies. When weights are later applied in neural network computations, and these computations are robust to high-frequency weight perturbations, this results in reduced in-band quantization error relative to uniform quantization (Xia et al., 27 Sep 2025).
2. Mathematical Formulation of SDQ-LLM Workflow
SDQ-LLM applies Sigma–Delta quantization in the following structured pipeline:
- Weight Block Preprocessing: Each row of a weight matrix is treated as a digital signal.
- Resampling (Upsampling): Each row is “resampled” (e.g., FFT zero-padding) from size to ( OSR), spreading the information and providing degrees of freedom for noise shaping.
- Σ–Δ Loop: The oversampled sequence is quantized via a Σ–Δ loop, producing a binarized or ternarized quantized block.
- Downsampling: The quantized block is returned to the original dimension by frequency-domain truncation or averaging.
- Error-Feedback Correction: Block-wise error-feedback, similar to that used in GPTQ, is applied: the quantization error for each block is computed and propagated to future blocks using inverse Hessian-based correction (Xia et al., 27 Sep 2025).
During inference, activations are upsampled via FFT or interpolation to match the oversampling factor, ensuring correct linear-compute alignment.
3. Over-Sampling Ratio (OSR): Continuous Trade-Off and Compression Metrics
Unlike analog Σ–Δ ADCs, which restrict OSR to integral values, SDQ-LLM permits fractional OSR to finely adjust compression–accuracy trade-offs: 0 where 1 is the bit-width of the quantizer (2 for ternary). For example, OSR=1.5 with ternary quantization achieves a compression ratio of approximately 3 relative to 16-bit float weights. Empirical results demonstrate a smooth, concave relation between perplexity and OSR, enabling dynamic adaptation to hardware or memory constraints without re-training (Xia et al., 27 Sep 2025).
4. Noise Shaping, Hadamard Weight Smoothing, and MultiOSR
Extremely low-bit quantization can suffer from large errors due to heavy-tailed weight distributions in LLMs. SDQ-LLM incorporates:
- Hadamard-Based Weight Smoothing: Applying a randomized Hadamard transform to each weight block flattens outlier weights, compresses energy to low/mid frequencies, and makes the overall weight distribution more amenable to high-frequency quantization noise shaping. After quantization, the inverse transform restores original basis alignment. Empirical ablations show that Hadamard smoothing is critical for usable accuracy at low OSR, e.g., reducing PPL from 4 to 5 in LLaMA3-8B models at OSR=2 (Xia et al., 27 Sep 2025).
- MultiOSR (Layer- and Linear-Wise OSR Allocation): By analyzing weight variance at both layer and submodule granularity, OSR is distributed preferentially to blocks with lower variance and/or larger parameter size, which are more sensitive to quantization noise. Exact allocation is conducted by:
6
and analogous formulas within each submodule, with normalization to maintain a global OSR budget. This fine-grained allocation further reduces perplexity when combined with smoothing.
5. Theoretical Guarantees for Sigma–Delta Quantization: Compressed Sensing Foundations
Classical Σ–Δ quantization in random frames and compressed sensing exploits oversampling to achieve error decay rates unattainable by memoryless quantization:
- For 7th-order Σ–Δ in a 8 random Gaussian compressed sensing matrix, Sobolev-dual frame reconstruction achieves:
9
for any 0, provided 1 (Güntürk et al., 2010). Here, 2 is the quantizer step size, 3 is the signal sparsity, and 4 is the number of measurements.
- In the PTQ regime, polynomial and even root-exponential decay of error in the oversampling ratio are achievable, substantially outperforming the 5 “flat” error plateau of memoryless schemes (Saab et al., 2015).
A plausible implication is that such noise-shaping guarantees provide the theoretical justification for the observed empirical improvements in SDQ-LLM when compared to uniform low-bit quantization in high-dimensional neural settings.
6. Empirical Performance and Implementation Notes
Experimental results confirm the practical efficacy of SDQ-LLM:
- On WikiText2, applying SDQ-LLM to OPT-1.3B with OSR=2 and ternary quantization yields PPL 6, far outperforming 2-bit RTN (catastrophic) and GPTQ (7), and achieving 8 lower PPL than BiLLM, despite lower average bitwidth (Xia et al., 27 Sep 2025).
- On zero-shot downstream tasks using OPT and LLaMA families (1–13B), SDQ-LLM recovers 9–0 of full-precision accuracy with OSR=2, while outperforming 2-bit GPTQ, PB-LLM, and BiLLM.
- Quantization time for SDQ-LLM's PTQ on OPT-13B is 1s, faster than PB-LLM or BiLLM and only modestly slower than GPTQ.
- Multiplications in quantized matmuls reduce to additions and bit-packing operations, affording direct computational savings on suitable hardware.
7. Limitations and Future Directions
SDQ-LLM, as presently constructed, is most efficient for OSR 2; lower OSR values still yield high error as measured by PPL. Potential advancements include higher-order Σ–Δ architectures and learned, task-specific OSR schedules (potentially via QAT). There is significant scope for custom hardware implementations that fully leverage the binary/ternary nature and noise-shaped properties of SDQ-LLM quantized models (Xia et al., 27 Sep 2025).
Table: Summary of Key SDQ-LLM Features and Empirical Results
| Feature | Description / Metric | Source |
|---|---|---|
| Lowest bit-width achieved | 1 bit (binary), 1.58 bit (ternary) | (Xia et al., 27 Sep 2025) |
| OSR control | Continuous, real-valued | (Xia et al., 27 Sep 2025) |
| Weight smoothing method | Hadamard blockwise | (Xia et al., 27 Sep 2025) |
| Sensitivity allocation | MultiOSR (variance + size) | (Xia et al., 27 Sep 2025) |
| Polynomial error decay | 3 | (Güntürk et al., 2010) |
| LLM PPL improvement | 4 over BiLLM at same bits | (Xia et al., 27 Sep 2025) |
In conclusion, SDQ-LLM generalizes and adapts Sigma–Delta quantization to the requirements of neural network quantization, combining mathematically principled noise-shaping, efficient blockwise implementation, and variance-aware adaptability to achieve state-of-the-art ultra-low-bit compression for large-scale LLMs.