Extreme Quantization in Neural Networks

Updated 24 June 2026

Extreme quantization is a neural network technique reducing weights to 1–2 bits, enabling significant compression while preserving near-full-precision performance.
It leverages structured methods like vector, product, and additive quantization to mitigate information loss and prevent feature collapse.
Advanced training strategies, including QAT and maximum entropy coding, ensure efficient deployment on edge devices with tight memory and energy budgets.

Extreme quantization refers to the regime of neural network quantization where parameter and/or activation numeric representations are pushed to their lowest practical bit-widths—typically 1–2 bits per weight (binary/ternary quantization), or equivalently ultra-low-bit vector/matrix codes—as well as the algorithmic, mathematical, and systems innovations that enable equivalent performance to higher-precision models under these severe representational limits. Extreme quantization is motivated by the stringent memory, energy, and inference-speed constraints of deploying large-scale models on edge and resource-constrained environments. Recent advances demonstrate that with carefully constructed training, quantization, and calibration methods, model accuracy and generalization can remain near-full-precision despite 32×–50× compression.

1. Motivation and Challenges of Extreme Quantization

Extreme quantization is fundamentally motivated by the need to fit powerful neural networks and LLMs onto devices with limited memory, storage, or compute—including mobile CPUs, microcontrollers, and dedicated accelerators. While int8 quantization (8 bits per parameter) is broadly deployed in mainstream deep learning systems, pushing to 4 bits or below introduces a new set of technical bottlenecks, such as information loss per channel, feature and rank collapse, and quantization-induced outliers. At 1–2 bits per parameter, naive scalar quantization leads to catastrophic accuracy degradation, necessitating advanced representational, training, and search methods to preserve performance.

Key empirical observations driving this regime include:

Standard training and post-training quantization fail dramatically below 4 bits due to rank collapse, low entropy in activation representations, or loss of fine-grained weight information (Pang et al., 19 Sep 2025, Egiazarian et al., 2024, Xu et al., 2024).
Structured quantization (product, vector, or additive) or codebook-based schemes are required to maintain model capacity and expressive power at such low bit budgets (Xu et al., 2024, Egiazarian et al., 2024, Shao et al., 2024).
The optimization geometry of quantized-model spaces becomes highly fragmented, causing many local optima and making codebook initialization critical (Kennedy et al., 9 Apr 2026).

2. Scalar, Binary, and Ternary Quantization: Formulations and Implementations

In the scalar domain, extreme quantization operates using the textbook 1-bit (binary) and ternary quantizers, often with per-layer or per-channel scaling factors. For a weight vector $w \in \mathbb{R}^n$ , the most common schemes are:

Binary (1-bit) quantization:

$q_{\mathrm{bin}}(w_i) = \alpha\,\mathrm{sign}(w_i),\quad \alpha = \frac{1}{n}\sum_j |w_j|$

Ternary (approx. 1.58 bits):

$q_{\mathrm{tern}}(w_i) = \begin{cases} +\alpha & w_i > +\Delta \ 0 & |w_i| \leq \Delta \ -\alpha & w_i < -\Delta \end{cases}$

With $\Delta=t\,\alpha$ , per-layer, where $t$ is a threshold parameter.

Activations are often quantized to 8 bits, using symmetric uniform quantizers (Wu et al., 2022).

These approaches, deployed via quantization-aware training (QAT) or knowledge distillation, enable BERT-scale transformers to be compressed 50×–60× with only modest degradation (1–2 GLUE points) (Wu et al., 2022). In the spiking LLM domain, these quantization rules are combined with binary spiking activations and equilibrium-based training, leading to >8× energy reduction and ∼32× model size reduction, with ternary weights achieving nearly full-precision accuracy (Bal et al., 2024).

3. Structured Vector and Product Quantization Under Extreme Bit-Budgets

As scalar quantizers become insufficient at ≤2 bits, advanced vector and product quantization methods dominate the extreme regime:

Vector Quantization (VQ) and Product Quantization (PQ) leverage codebook-based representations, where weights are partitioned into subvectors and quantized as indices into learned codebooks (Liu et al., 2024, Shao et al., 2024, Huang et al., 2023).
Additive Quantization and Multi-Codebook (MCQ) schemes (e.g., AQLM) express weight blocks as sums of codewaords from multiple learned codebooks, with assignments optimized to minimize the output or Hessian-weighted loss (Egiazarian et al., 2024, Xu et al., 2024, Kennedy et al., 9 Apr 2026).

A typical VQ/PTQ objective minimizes per-channel or global reconstruction error (including Hessian-second order approximations in LLMs (Liu et al., 2024)), often with extra refinement for outliers or important channels (Xu et al., 2024, Liu et al., 2024). In diffusion model compression, product quantization is crucial to avoid exponential codebook growth and error accumulation (Shao et al., 2024).

Proper codebook initialization, such as Hessian-weighted K-means or output-aware EM, is necessary for 2-bit quantization, since naive (greedy) initialization frequently leaves high perplexity basins that even extensive search or fine-tuning cannot overcome (Kennedy et al., 9 Apr 2026). Channel-relaxed schemes (CRVQ) resolve the tightest bottleneck by allocating richer codebooks only to a sparse subset of critical weight channels, delivering >30% reductions in perplexity for near-1-bit PTQ (Xu et al., 2024).

4. Quantization-Aware Training, Feature Collapse, and Maximum Entropy Coding

Extreme bit-widths inject strong representational bias, especially during QAT. QAT at 1–4 bits often leads to "feature collapse," where the network learns highly redundant, low-rank activations with low differential entropy (Pang et al., 19 Sep 2025). Explicit regularization is required to prevent collapse and preserve feature diversity:

Maximum Entropy Coding Quantization (MEC-Quant) addresses this by regularizing the activations to maximize a coding-length surrogate of entropy, based on a Taylor expansion of the rate-distortion theoretic minimum code length:

$L(\mathbf{Z}) = \mu \log\det\left( \mathbf{I}_m + \frac{d}{m\epsilon^2} \mathbf{Z}^\top\mathbf{Z} \right)$

With a tractable Mixture-of-Experts approximation to handle long-tailed activation spectra (Pang et al., 19 Sep 2025).

The resulting loss is smoothly combined with the downstream task loss, and a curriculum is applied to ramp the strength of the entropy surrogate over training epochs.

MEC-Quant closes—or even reverses—the performance gap to full-precision networks at bit widths down to 2/2 (weights/activations), shows lower Hessian curvature (suggesting better generalization), and achieves new SOTA for extreme QAT (Pang et al., 19 Sep 2025).

5. Innovations Addressing Quantization Pathologies

Several structural challenges are endemic to the extreme quantization regime:

Activation Outliers and Quantization Collapse: Transformer architectures produce catastrophic activation outliers that break FP8 and W8A8 training or inference. The TWEO (Transformers Without Extreme Outliers) regularizer addresses this via an $L^4$ penalty on out-of-range activations, reducing outliers from 10,000+ to <20, thereby enabling true FP8 training (with throughput +36% over BF16) and making W8A8 per-tensor quantization viable (Liang et al., 28 Nov 2025).
Token Norm Imbalance (TNI): Key/Value cache quantization suffers from structural token norm disparities, causing per-channel quantization to amplify errors when forced to use a single step-size over disparate tokens. OScaR resolves TNI via a fixed Hadamard rotation ("canalized rotation") to redistribute energy across channels, and then per-token ℓ₂-rescaling, enabling near-lossless INT2 quantization with no special training (Su et al., 19 May 2026).
Empty Clusters and Search Failures: Product quantization with Quant-Noise (iPQ+QN) fails in the extreme regime due to rampant empty clusters, leading to severe accuracy degradation. Partitioning-guided K-means pre-assignment and dynamic empty cluster resolution guarantees all clusters are filled, reducing empties by ≈100× and recovering up to 12 points on GLUE with no memory overhead (Huang et al., 2023).

6. Training Strategies and Hardware Integration

Extreme quantization strategies span several mechanisms to maximize quality and efficiency:

Quantization Noise Training (QNT): Instead of quantizing all weights in every forward pass (as in vanilla QAT), QNT randomly quantizes only a fraction $p$ , preserving unbiased gradients in the remainder and sharply reducing bias from the STE, especially important under product quantization or aggressive scalar rounding (Fan et al., 2020).
Orthogonal and Butterfly Transform Preconditioning: Incoherent mixing and adaptive rotation schemes, such as randomized or learned Hadamard transforms (RHT/HARP), redistribute input and weight variance, maximizing per-block entropy for blockwise quantization. Adaptive variants like HARP optimize the rotation using only calibration data, strictly preserving full-precision equivalence, and yield significant perplexity and accuracy gains in LLMs at 2–4 bits (Zagitov et al., 28 May 2026).
Efficient Inference Kernels: Efficient inference for low-bit formats leverages vectorized lookup (LUT) dequantization (Kennedy et al., 9 Apr 2026), structured low-rank factorizations (e.g., binary GEMMs in MDBF (Ichikawa et al., 31 Dec 2025)), and careful memory layout that matches accelerator primitives (Xu et al., 2024, Su et al., 19 May 2026).

7. Empirical Results and Practical Recommendations

Extreme quantization’s empirical effectiveness is demonstrated across task domains and architectures:

Model/Domain	Bit-width (W/A)	Compression	Reference	Perplexity / Accuracy
BERT/GLUE	1b/8b	50–60×	(Wu et al., 2022)	~81.3–81.7 (avg)
SpikingLM/GLUE	1–1.58b	32× weights	(Bal et al., 2024)	2–3pp from FP
ResNet-18/CIFAR10	2/2, 4/4, FP	–	(Pang et al., 19 Sep 2025)	88.51 vs 88.72
LLaMA-2 7B/QA	~2bpp	~8–16×	(Liu et al., 2024, Egiazarian et al., 2024)	58.2%, 6.13 (PPL)
LLaMA2-7B/QA/Wiki	~1.07bpp	>30×	(Xu et al., 2024)	13.01 (PPL)
DiT/ImageNet	1–2b	14.9–28.7×	(Shao et al., 2024)	FID 6.8–14.0
LLM KV Cache	INT2	5.3× KV	(Su et al., 19 May 2026)	<1pp degradation

Best practices across leading works include: careful long-schedule QAT with single-stage KD (Wu et al., 2022); entropy/coding-length regularization (Pang et al., 19 Sep 2025); calibration-based, Hessian- or output-aware codebook initialization (Kennedy et al., 9 Apr 2026, Liu et al., 2024, Xu et al., 2024); and orthogonal preprocessing (Zagitov et al., 28 May 2026, Pavlov, 24 May 2026).

Notably, accuracy losses are often bounded at 1–2 perplexity points or 1–3 percentage points above the 2-bit (or even 1-bit) ceiling, when these methods are appropriately tuned and applied.

References:

(Wu et al., 2022, Egiazarian et al., 2024, Ichikawa et al., 31 Dec 2025, Pang et al., 19 Sep 2025, Xu et al., 2024, Shao et al., 2024, Liu et al., 2024, Kennedy et al., 9 Apr 2026, Fan et al., 2020, Huang et al., 2023, Su et al., 19 May 2026, Liang et al., 28 Nov 2025, Bal et al., 2024, Pavlov, 24 May 2026, Zagitov et al., 28 May 2026)