Outlier-Preserving Quantization (OPQ)

Updated 13 November 2025

OPQ is a quantization technique that isolates and preserves rare, extreme weight values (outliers) to prevent dynamic range inflation in neural networks.
It leverages blockwise mixed-precision, statistical and quantile-based outlier detection, and function-preserving transforms to maintain high fidelity.
OPQ integrates hardware-aware strategies like index coding and adaptive micro-block encoding to deliver significant storage, compute, and energy savings.

Outlier-Preserving Quantization (OPQ) encompasses a family of algorithmic and hardware techniques for quantizing neural network weights and activations while explicitly addressing extreme values—outliers—that would otherwise severely degrade low-bit quantization accuracy. By identifying outliers and reducing their influence on blockwise scaling or allocating higher-precision storage, OPQ achieves both aggressive compression and strong empirical accuracy across large models, often approaching full-precision baselines with 2× or greater savings in storage and compute. OPQ methods trace their conceptual lineage from explicit mixed-precision quantization, blockwise quantization with dynamic outlier handling, and memory- or throughput-constrained accelerator design.

1. Outlier Identification and Notational Framework

The central technical challenge that OPQ addresses is the "outlier problem": the presence of rare, large-magnitude weights or activations inflates quantization ranges, especially in blockwise and uniform quantization. Outliers are defined operationally in terms of their deviation from typical scale within a quantization block (e.g., row, channel, or token group). Canonical identification criteria include:

Channel Norms or Blockwise Maxima: For a block $\mathbf{x} = [x_1, \dots, x_n]^T$ , the outlier criterion is $|x_i| \gg \text{median}_j|x_j|$ within the block or channel. In key matrices of LLMs, a channel $j$ may be classified as "outlier-rich" if $r_j = \|\mathbf{W}_K[j, :]\|_2 \gg \text{median}_k\|\mathbf{W}_K[k, :]\|_2$ , often using a percentile threshold (e.g., above the 95th percentile) (Trukhanov et al., 29 Mar 2024).
Quantile-based Thresholds: For quantization schemes such as ICQuant, outliers are identified as those entries exceeding a quantile-based threshold, e.g., the top 5% of absolute values in each row or block (using the $(1-\gamma)$ quantile of $|w|$ with $\gamma=0.05$ ) (Li et al., 1 May 2025).
Statistical Definitions—Mean and Standard Deviation: An outlier $w$ satisfies $|w - \mu| > k\,\sigma$ for macro-block mean $\mu$ and standard deviation $\sigma$ , typically with $k=3$ (Ramachandran et al., 8 Nov 2024, Guo et al., 2023).
Hessian-based Sensitivity: For mixed-precision schemes, outliers are columns whose quantization would disproportionately contribute to activation-space error as determined by a Hessian-aware sensitivity metric (Lee et al., 2023).

2. Outlier-Preserving Quantization Algorithms

OPQ methodologies can be grouped by their block granularity and handling strategy:

a. Blockwise Mixed-Precision (Selective High-Precision Storage)

BOF4-S + OPQ: In blockwise quantization (e.g., BOF4-S), weights exceeding a quantile-derived outlier threshold are stored in 16 bits (bfloat16), while the remainder are quantized with a 4-bit optimal float codebook. Outliers are removed prior to block normalization to prevent spurious scale inflation. Indices of outliers are compressed using compact pointer arrays. This hybrid approach achieves <1% memory overhead per block (e.g., block size $I=64$ , $q=0.95$ ) and state-of-the-art perplexity among 4-bit quantizers (Blumenberg et al., 10 May 2025).

b. Permutation and Block Sorting

Channel-wise Grouping (“K-sort”): Prior to quantization, rows of key and value matrices are permuted so that outlier-heavy channels are grouped together within blocks, significantly narrowing block dynamic range and improving quantization fidelity (notably in BFP12 quantization for LLM KV-caches) (Trukhanov et al., 29 Mar 2024).

c. Explicit Outlier Codebooks and Index Coding

Index Coding (ICQuant): Outliers are encoded and quantized separately from inliers, with their positions efficiently recorded via gap-based index coding. This enables quantizing both subpopulations in halved dynamic range using the same bitwidth $n$ , effectively matching the quantization error of $n+1$ bits but incurring only $\approx 0.3$ bits overhead per weight for index coding (with typical $\gamma=5\%$ outlier ratio and $b=5$ index bits). This closes the dynamic range gap of 1 bit in compression-critical regimes (e.g., 2.3 bits/weight for Llama-3 70B at no-tune PTQ near baselines) (Li et al., 1 May 2025).

d. Adaptive Preprocessing: Fine-Tuning and Function-Preserving Transforms

Outlier-Regularized Fine-Tuning: Additional loss terms penalize large deviations between maximal and median activations within each channel or tensor, effectively discouraging dynamic range amplification by outliers while minimally altering standard gradient flow. Penalties based on standardized outlier deviation can be tightly integrated into backpropagation with minimal overhead (Chen et al., 11 Mar 2024).
Function-Preserving Block Transforms: Orthogonal (blockwise or full), permutation, and scaling transforms (e.g., as in FPTQuant and DuQuant) redistribute and attenuate both normal and massive outliers before quantization. Transform parameters are either optimized locally (on a calibration set to directly contract outlier values) or globally (end-to-end, seeking quantized outputs close to full-precision via student-teacher minimization). These transforms can be merged (“folded”) into network weights, incurring no inference cost (Breugel et al., 5 Jun 2025, Lin et al., 3 Jun 2024).

e. Hardware/Accelerator-Aware OPQ

Micro-block and OVP Encoding: Several schemes, notably MicroScopiQ, OPAL, and OliVe, operate on micro-blocks (e.g., 8 or 128 elements): the largest-magnitude entries (outliers) are microscaled or encoded as higher-precision values. Pruning or paired compression of less salient (victim) entries enables storage of outlier mantissas and efficient alignment with accelerator hardware (Ramachandran et al., 8 Nov 2024, Koo et al., 6 Sep 2024, Guo et al., 2023).
On-the-fly Outlier Handling (OverQ): Hardware-based dynamic outlier detection leverages zero activations to temporarily “borrow” bits for encoding outliers within fixed-point quantization (range-overwrite/precision-overwrite). This is accomplished with additional datapath logic, exploiting structural sparsity (as in post-ReLU layers) for >90% outlier coverage at negligible (<1%) area cost (Zhao et al., 2019).

3. Mathematical Underpinnings and Error Analysis

OPQ improves quantization fidelity by:

Dynamic Range Reduction: For any block, removing or isolating outlier(s) narrows the quantization range for inliers. For instance, when the block dynamic range decreases by a factor of 2, a quantizer using $n$ bits over the reduced range yields step size $\Delta_1 = (R/2)/(2^n-1) \approx \Delta_0 = R/(2^{n+1}-1)$ for a naive $n+1$ bit quantizer (Li et al., 1 May 2025). Relative error bounds for BFP or blockwise quantization schemes become strictly tighter post outlier removal or grouping.
Distributional Matching: By matching codebooks to inlier distributions and detaching outliers from normalization/scale calculation, blockwise OPQ ensures that quantization error accumulation is not dominated by single large values, and that codebook density is well-aligned with the "typical" core data.
Sensitivity-weighted Error: For OWQ, attribution of error to columns is $E_i \approx \sum_{j\notin O}\lambda_j\|\Delta W_{:, j}\|_2^2$ with $\lambda_j$ the activation covariance. Mixed-precision allocation focuses budget where quantization is most damaging (Lee et al., 2023).
Information Theoretic Overhead: For uniform random outlier positions, index-coding overhead is precisely analyzable (ICQuant: $E[B] \leq \gamma b (1 + 1 / (e^{\gamma(2^b - 1)} - 1))$ ) and typically sublinear in the outlier rate.

4. Hardware Implications and Accelerator Integration

OPQ variants are designed with hardware-awareness for deployment in low-power, high-throughput inference:

Block Floating-Point Tensor Units: BFP quantization with compile-time outlier grouping (as in K-sort) preserves hardware efficiency, as required reference scaling and computation paths are unchanged from standard BFP units (Trukhanov et al., 29 Mar 2024).
Mixed-Precision Paths: Implementations that bifurcate outlier and inlier processing (e.g., FP lanes for outliers, INT lanes for others as in OPAL) maintain vector alignment and minimize memory fragmentation. Bit-pruning and slot-permutation (MicroScopiQ) distribute outlier bits without disrupting block alignment, yielding $>3 \times$ acceleration and $2 \times$ energy savings (Ramachandran et al., 8 Nov 2024, Koo et al., 6 Sep 2024).
Minimal Runtime Cost: Most OPQ identification and grouping occurs at compile or model load time, with marginal one-off overhead. Index decoding and permutation operations at inference are minor compared to overall computation.
Local Encoding (OliVe): Outlier-Victim-Pair encoding utilizes strictly local (pairwise) codes, avoiding costly global sparsity pointers and facilitating low-latency SIMD and accelerator integration (Guo et al., 2023).

5. Empirical Performance and Comparative Evaluations

State-of-the-art OPQ approaches consistently achieve near-baseline perplexity and classification accuracy under aggressive compression:

Method	Bitwidth / Block	Overhead	PTQ/Fine-Tune	LLM Perplexity/Accuracy	Reference
BOF4-S + OPQ	4-bit, blockwise	<1% memory	PTQ	8.43 (Llama-3.1 8B), lowest among 4b PTQ	(Blumenberg et al., 10 May 2025)
K-sort OPQ (BFP12)	4.25 b/elem, B=32	None	PTQ	9.52 (WikiText-2), $\sim$ FP16 accuracy	(Trukhanov et al., 29 Mar 2024)
ICQuant	2.3 bits/weight	0.3 bits/weight	PTQ	5.65 (Wiki2, Llama3-70B 2b), SOTA PTQ	(Li et al., 1 May 2025)
OPAL (activations)	4b/7b, 128 block	2.7% over MXINT8	PTQ	+0.33 ppl vs W4A16, +1.6--2.2 $\times$ eff.	(Koo et al., 6 Sep 2024)
SplitQuant	INT2/3 branches	2--3 $\times$ param	PTQ	$+3.3$ pp INT2 BERT-Tiny vs INT2 base	(Song et al., 21 Jan 2025)
DuQuant	4b blockwise	$\sim$ 1.5-9% comp.	PTQ	6.40 ppl, +5pp over baseline W4A4	(Lin et al., 3 Jun 2024)
OverQ (ASIC)	4b + RO/PR	<1% area	PTQ	+5.85pp ImageNet@4b, >90% outlier coverage	(Zhao et al., 2019)

Fine-tuning methods (e.g., QuantTune) close $>75\%$ of the accuracy gap to full precision with negligible training/inference overhead. On transformers, OPQ closes the gap to baseline 4/7/8-bit quantization by absolute margins as large as $+33.8\%$ on vision tasks and $+18.84\%$ on ViT models relative to standard PTQ baselines (Chen et al., 11 Mar 2024).

6. Limitations and Open Challenges

Despite broad empirical success, OPQ methodologies present tradeoffs:

Model Expansion vs. Storage Overhead: Techniques such as SplitQuant triple parameter matrices for each processed layer but partially recoup this via sparse encoding. MicroScopiQ and index-coding OPQ maintain fixed storage budgets per block via bit reallocation and intelligent permutation (Song et al., 21 Jan 2025, Ramachandran et al., 8 Nov 2024, Li et al., 1 May 2025).
Latency and Implementation Complexity: Run-time decoding (for pointer indices, outlier blocks) requires minimal but non-zero time. Hardware co-designs must balance complexity/area with accuracy gains (e.g., OverQ: 8--10% mux/shifter overhead at PE level, but 0.5% total area impact) (Zhao et al., 2019).
Distributional Assumptions: Some index-coding schemes (ICQuant) rely on outlier position uniformity; future model architectures or distributions may require adaptive permutation or alternate coding to maintain efficacy (Li et al., 1 May 2025).
Applicability Beyond Weights: Most designs target weights, KV-cache, or activations. Dynamic data (e.g., streaming activation quantization) poses additional challenges, especially for index-coding or victim-pair logic.
Scaling to Ultra-Low Bits: At 1–1.5 bits per weight, index-coding overhead rises and global clustering/pairing may require further innovation.

7. Significance in Quantization Landscape and Future Directions

OPQ frameworks yield a unifying perspective on mitigating dynamic range inflation in quantized deep neural networks. The field demonstrates a rich spectrum of approaches, from mixed-precision and direct index-coding, through function-preserving transforms, to hardware-software co-design. Research continues to refine index efficiency, inliner/outlier tradeoffs, and dynamic block formation (e.g., learnable block sizes, data-driven outlier ratios). Integration of OPQ into end-to-end pipelines offers immediate gains in model deployment, especially for LLMs and large vision backbones under severe memory, power, or latency constraints.

Among open questions remain: optimal outlier ratio selection per layer, automated calibration for diverse architectures, and generalization of OPQ concepts to on-the-fly streaming contexts or dynamically structured models. Theoretical characterization of optimal block rearrangement or codebook allocation for nonstationary or temporally varying distributions also represents an active frontier.

OPQ approaches have demonstrated, across multiple research groups and hardware platforms, the ability to sustain near-baseline accuracy under aggressive low-bit quantization while ensuring practical deployment feasibility for massive neural models in both datacenter and edge settings.