Outlier-aware QMC: Compression & Hardware Co-Design
- The paper pioneers explicit handling of high-magnitude outliers in LLM quantization, achieving 4×–8× memory compression with minimal accuracy degradation.
- It details a co-design method that integrates fine-grained quantization, memory layout, and hardware microarchitecture to maximize energy efficiency and throughput.
- Benchmark results demonstrate improved perplexity, reasoning accuracy, and dynamic cache compression across diverse transformer models.
Outlier-aware Quantization with Memory Co-design (QMC) refers to a class of algorithm-architecture co-design methods that enable highly compressed and hardware-efficient inference for LLMs and foundational models (FMs) by combining fine-grained quantization schemes with explicit handling and preservation of rare, high-magnitude outlier values. QMC aims to maximize memory savings, bandwidth reduction, and energy efficiency, while maintaining or minimally sacrificing model accuracy, through synergetic innovations in quantization algorithms, memory layout, and hardware microarchitecture.
1. Motivation and Challenges of Outlier-aware Quantization
State-of-the-art transformers and LLMs exhibit heavy-tailed weight and activation distributions, with a small fraction of elements (“outliers”) exhibiting magnitudes several standard deviations above the mean. Naive low-bit-width quantization (e.g., uniform 2-bit or 4-bit integer) incurs disproportionate quantization error for these outliers, leading to significant degradation in perplexity (PPL), accuracy on reasoning/QA, and even instability during long-sequence inference (Su et al., 25 Jan 2025, Ramachandran et al., 2024, Guo et al., 2023).
Existing alternatives fall into two camps:
- Mixed-precision (retaining outliers at high precision, quantizing only inliers): Good accuracy but poor effective bit-width (EBW), leading to wasted memory, non-uniform bandwidth, and hardware inefficiency (Ramachandran et al., 2024).
- Uniform low-precision (all elements quantized equally): Maximum compression and hardware alignment but large accuracy drops at low bit-width due to outlier loss (Su et al., 25 Jan 2025, Trukhanov et al., 2024).
The central challenge is thus to compress model weights, activations, or dynamic caches as aggressively as possible—often to 2–4 bits—without suffering intolerable accuracy loss from outlier quantization error, while ensuring hardware memory and dataflow remain aligned and efficient.
2. Algorithmic Methods for Outlier-aware Quantization
All QMC approaches share the principle of explicit and fine-grained outlier identification, with tailor-made quantization strategies for outliers and inliers. The core workflow consists of:
- Outlier Detection: Classify high-magnitude elements as outliers, typically via a thresholding rule such as the 3σ empirical standard deviation, or by selecting the top k elements by absolute value within a block or channel (Su et al., 25 Jan 2025, Ramachandran et al., 2024).
- Piecewise Quantization:
- Inliers: Quantize with a uniform, low-bit-width quantizer (2–4 bits) with scale/zero-point learned per-channel or per-block (Su et al., 25 Jan 2025, Ramachandran et al., 2024, Pandey et al., 21 Jan 2026).
- Outliers: Store in higher precision (e.g., 8 bits floating, FP16, or mixed-precision micro-scaled float), or use a locally wider quantizer (Ramachandran et al., 2024, Su et al., 25 Jan 2025).
- Memory Alignment and Layout: Compact data such that both inliers and outliers are packed into contiguous, memory-aligned structures, with minimal control or indirection overhead (Guo et al., 2023, Ramachandran et al., 2024).
- Optionally, Pruning or Victim Reallocation: Prune least-salient weights to free memory for extra outlier bits (Ramachandran et al., 2024), or use a “victim” mechanism, where an inlier adjacent to an outlier gives up its slot (Guo et al., 2023).
- Block/Channel Grouping or Rotation: Employ block floating-point (BFP), permutation (“K-sorting”), or adaptive rotations (Walsh–Hadamard) to rearrange outlier locations, concentrate dynamic range, and further improve quantization fit (Su et al., 25 Jan 2025, Trukhanov et al., 2024).
Quantization and outlier handling are jointly optimized, sometimes via alternating minimization over quantizer step size and block/group selection, subject to a target reconstruction tolerance.
3. Memory and Hardware Co-Design
QMC targets not just algorithmic compression, but end-to-end systems efficiency via tight coupling of data representation, memory organization, and compute microarchitecture. Notable strategies include:
- Block Floating Point (BFP) and Microscaling: Blocks of activations/weights share an exponent (“scale”), all inlier mantissas are quantized to low bit-width (e.g., 4 bits), and a tiny metadata trail indexes sparse outlier positions and values (Koo et al., 2024, Trukhanov et al., 2024). This format maximizes memory alignment and enables streamlined memory fetches and vectorization.
- Heterogeneous Memory Hierarchies: For on-device and edge inference, QMC partitions inliers (majority of weights) to dense, low-power non-volatile memory (e.g., MLC ReRAM, 2–3 bits per cell), while routing outliers to higher-precision, low-latency MRAM (Pandey et al., 21 Jan 2026). Dedicated memory controllers issue synchronous reads to both banks, then aggregate streams for computation.
- Hardware Datapath Specialization: Integer processing elements (PEs) execute the dense inlier compute, while small floating-point or mixed-precision units (or “ReCoN” NoCs) selectively handle merged outlier flows (Ramachandran et al., 2024, Koo et al., 2024). For KV-cache quantization, quantized keys/values are stored in compressed memory planes, and decompressed only at recall.
- Memory-aligned Outlier Pairing: Outliers replace non-critical inlier (“victim”) values, and both are encoded in the same byte (4-bit OVP) or word, avoiding per-block outlier tables and indirections (Guo et al., 2023).
These co-designs are tuned to balance latency, power, area, and memory footprint, with quantization schedule (block size, outlier ratio), precision assignment, and hardware memories jointly optimized.
| Co-Design Strategy | Memory/Energy Effects | Hardware Implication |
|---|---|---|
| BFP/outlier-aware | 2–4× memory cut, low error | Shared exponents, small outlier metadata |
| MRAM+ReRAM hierarchy | 6–7× mem, 11× energy, 12× latency reduction | On-chip MRAM/MLC ReRAM, parallel fetch |
| OVP Pairing | 4–5× speedup/energy sav. | Local decode, no indirection |
| Prune-for-outlier-bits | 10–45% extra mem gain, aligned | Permutation/merge logic, PEs homogeneity |
4. Rotational, Blockwise, and Token-aware QMC Variants
QMC encompasses various specialized methods tailored to weights, activations, and dynamic attention caches:
4.1 Rotational KV-Cache Quantization
RotateKV (Su et al., 25 Jan 2025) applies:
- Pre-RoPE grouped-head Hadamard rotations: Smooth per-head outlier energy; reduce maximal per-head activation range.
- Channel-reordering + Fast Walsh–Hadamard rotation: High-variance channels are permuted to outlier “sink” indices, enabling explicit detection and absorption. Post-rotation, outliers are restricted to ~2–4 channels per head.
- Sink-aware quantization: Non-sink channels are quantized uniformly to 2 bits (per-channel scale); sink channels are kept as FP16 or 4 bits for dynamic range.
KV-cache memory is thus slashed by ≈4×, with PPL degradations ≤0.1 and negligible loss in arithmetic reasoning (GSM8K), and batch size increases up to 3×.
4.2 Token-trace for KV Quantization
OTT (Su et al., 16 May 2025) identifies token-wise outlier keys/values that disproportionately expand channel dynamic range. These tokens are retained at full precision in a sliding group-based pool, while the remainder are quantized. The approach achieves a 6.4× memory reduction and 2.3× higher decode throughput, while recovering up to 1.8 points in normalized accuracy on long-context and reasoning tasks versus tuning-free baselines.
4.3 Block Quantization and Compile-time Rearrangement
Accurate Block Quantization (Trukhanov et al., 2024) exploits channelwise “K-sort” permutations, compiling high-magnitude rows together so BFP blocks contain similar-magnitude elements—minimizing intra-block dynamic range and quantization error. This method yields ≈4× KV-cache compression at <1% PPL loss, with zero runtime penalty.
5. Quantitative Evaluation and Empirical Results
Across several benchmarks (LLaMA, Hymba, Qwen, etc.) and a range of QMC instantiations:
- Compression ratios: Typically 4×–8× for weights or cache (2-bit quantization with selective outlier handling), with practical overheads reducing the theoretical maximum by 10–20% to account for sparse outlier metadata.
- Accuracy retention: At (2,8)-bit or (4,8)-bit QMC, PPL increases are Δ≤0.1–2 versus FP16; on causal QA and reasoning, accuracy loss is <1.7% (Su et al., 25 Jan 2025, Ramachandran et al., 2024).
- Throughput and energy: Integer-pipeline and hardware-aligned QMC enables 2–5× higher throughput density (TOPS/mm²) and up to 68% total energy reduction (Koo et al., 2024, Ramachandran et al., 2024, Pandey et al., 21 Jan 2026, Guo et al., 2023).
Selected empirical results (from (Su et al., 25 Jan 2025, Ramachandran et al., 2024, Pandey et al., 21 Jan 2026)):
| Method | Compression | PPL (WT2) | Reasoning Accuracy | Energy Reduction | Batch Size Gain |
|---|---|---|---|---|---|
| FP16 Baseline | 1× | 5.32–6.13 | 23.7% (GSM8K) | 1× | 1× |
| QMC (2b, rot.) | 4× | +0.03 | –0.2% | 10–12× | 3–5.75× |
| OVP/OliVe | 4–5× | <2× PPL increase | – | 4–5× | – |
| BFP K-sorted | ~4× | <1% incr. | – | – | – |
| QMC (edge/EMEM) | 6–7× | ~1 | <1.5% | 11–12× | – |
6. Limitations, Trade-offs, and Design Considerations
Despite substantial gains, QMC has several caveats:
- Overhead: Outlier detection/pruning incur preprocessing costs and minor metadata overhead. Some methods require a calibration pass, or sliding window buffer space for groupwise quantization (Su et al., 25 Jan 2025, Su et al., 16 May 2025).
- Hardware area: Mixed-precision or outlier-parallel pipelines add 2–3% logic area vs. naive integer only (Ramachandran et al., 2024, Guo et al., 2023).
- Robustness: On workloads with extremely rare but clustered outliers, victim pruning or outlier block packing may briefly degrade accuracy or require fallback to higher bit-width logic (Guo et al., 2023).
- Bandwidth contention: For SLM edge deployments, optimal partitioning between outlier and inlier memories (MRAM/ReRAM bandwidth split) is latency/energy-coredependent and may require hardware/software joint tuning (Pandey et al., 21 Jan 2026).
Designers are advised to balance outlier ratio, bit allocations, block/group size, and memory hierarchy in accordance with hardware constraints and target task accuracy.
7. Extensions and Future Directions
Recent work suggests several avenues for advancing QMC:
- Joint Quantization across Weights and Dynamic Caches: Extending outlier-aware schemes to cover both static weights and sequence-length-proportional caches under a global, unified memory co-design (Su et al., 25 Jan 2025).
- Learned or Data-driven Rotations: Exploring the use of learned (not fixed Hadamard) rotations for higher quantization fidelity, potentially via quantization-aware fine-tuning (Su et al., 25 Jan 2025).
- Generalization to Other Emerging Memory Technologies: Adapting QMC to cross-layer hybrid memories (e.g., 3D-stacked NVM, future MRAM/ReRAM/PCM), accounting for device-specific error models (Pandey et al., 21 Jan 2026).
- Ultra-low Bit-width and Pruning Synergy: Further combining pruning (via Hessian or saliency statistics) to dynamically distribute outlier-bit budget and approach hardware-theoretical minimum EBW (Ramachandran et al., 2024).
QMC’s algorithm-architecture coupling provides a methodology for scalable, hardware-efficient LLM and FM deployment, with applicability ranging from datacenter to edge and mobile platforms. The robust preservation of rare outliers, efficient quantization, and aligned memory organization together underpin a new generation of high-throughput, low-power AI accelerators (Su et al., 25 Jan 2025, Ramachandran et al., 2024, Koo et al., 2024, Pandey et al., 21 Jan 2026, Guo et al., 2023, Trukhanov et al., 2024, Su et al., 16 May 2025).