Papers
Topics
Authors
Recent
Search
2000 character limit reached

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

Published 19 May 2026 in cs.LG and cs.CL | (2605.19660v1)

Abstract: The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.

Summary

  • The paper presents a dual-stage framework, OScaR, that mitigates token norm imbalance using canalized rotation and omni-token scaling for extreme low-bit KV cache quantization.
  • It employs a training-free, CUDA-optimized implementation with fused Hadamard-norm kernels to achieve high efficiency, robust accuracy, and reduced memory footprint in both text-only and multi-modal LLMs.
  • Empirical results demonstrate up to a 3.0x decoding speedup and a 5.3x memory reduction, positioning OScaR on the Pareto front for balancing accuracy and efficiency in LLM inference.

OScaR: A Framework for Extreme KV Cache Quantization in X-LLMs

Motivation and Problem Statement

The increasing adoption of long-context and multi-modal LLMs mandates efficient memory management strategies, as the memory footprint of the attention Key-Value (KV) cache scales linearly with sequence length and quickly becomes a primary bottleneck in both inference throughput and deployment scalability. Extreme low-bit KV cache quantization is a promising avenue for reclaiming memory efficiency, but it is fundamentally challenged by the presence of both channel-wise and token-wise outliers, notably Token Norm Imbalance (TNI), which the paper rigorously identifies as the structural impediment to high-fidelity per-channel quantization (2605.19660).

Analysis of Token Norm Imbalance (TNI)

TNI denotes substantial intra-channel token norm variance, manifesting as consistent low-norm outlier tokens ("attention sinks") or, in multi-modal contexts, as broader norm disparitiesโ€”both within modality sequences and across modalities. Empirical and theoretical analyses reveal that per-channel quantization schemes fail under TNI, as quantization parameters must span tokens with widely divergent norms. This amplifies quantization error, especially at extreme bit-widths, undermining accuracy and model robustness in both text-only and multi-modal LLMs. The paper substantiates this with systematic norm profiling, rigorous error derivations, and quantitative evaluation across variant LLM architectures.

Methodological Contributions: OScaR Framework

OScaR (Omni-Scaled Canalized Rotation) is introduced to directly address TNI in a lightweight, training-free manner, guided by Occam's Razor for minimal auxiliary overhead. The approach proceeds in two stages:

  • Canalized Rotation: The Hadamard Transform is applied online to Keys (and Queries), redistributing outlier channel energy and preventing scaling-induced artifacts. This orthogonal transformation, optimized for Tensor Core execution, ensures that subsequent normalization is not dominated by channel-wise outliers.
  • Omni-Token Scaling: After rotation, token-wise scaling balances the norm of each token across the sequence, using rsqrt-based normalization (hardware-accelerated). Unlike direct token scalingโ€”which introduces artificial channel-wise outlier artifactsโ€”this process operates safely post-rotation, mitigating both TNI and outlier effects without degrading per-channel quantization fidelity.

Both steps are essential: Canalized Rotation ensures the scaling step does not artificially inflate channel ranges, while Omni-Token Scaling resolves TNI. The framework is universally applicable across text-only, multi-modal, and omni-modal LLMs, and demonstrates robustness under INT2 quantization.

System-Level Design and CUDA Implementation

OScaR features a CUDA-optimized implementation, leveraging HadaCore and BitDecoding primitives with fused Hadamard-norm kernels, GPU-efficient quantization, and dequantization-attention fusion. All transformations and quantization occur online, with residual handling and periodic cache packing for long-context sequences. Token-wise norm metadata is efficiently managed, and memory layout adheres to 2-bit representation with minimal overhead.

Complexity analysis, supported by operation counts and cost conversion, positions OScaR favorably against strong baselines in terms of both arithmetic and lookup cost, avoiding LUT-induced hardware inefficiencies and achieving competitive empirical throughput.

Empirical Evaluation and Numerical Results

OScaR is extensively evaluated across leading benchmarks and LLM variants:

  • Text-Only LLMs: On LongBench-E, OScaR achieves the highest average accuracy among quantized methods (41.75%), outperforming the next-best by 1.01 percentage points; NIAH retrieval accuracy peaks at 96.5%, minimally exceeding even the 16-bit baseline.
  • Multi-Modal/Omini-Modal LLMs: On OCRBench and DocVQA, OScaR outperforms competing quantized methods by up to 2.5 percentage points and matches or slightly exceeds full-precision performance under INT2 quantization. On MMAU-Pro, it achieves open-ended QA and AIF scores that exceed both the 16-bit baseline and all strong quantized alternatives.
  • Efficiency: At 128K context length, OScaR delivers a 3.0x decoding speedup relative to FlashDecoding-v2, reduces memory footprint by 5.3x, and boosts throughput by 4.1x. Latency remains stable across context lengths, outperforming TurboQuant+ at long sequences. OScaR sits prominently on the accuracy-efficiency Pareto front, offering the highest accuracy at a moderate computational cost.

Theoretical and Practical Implications

OScaR advances KV cache quantization methodology by demonstrating that complexity in quantization pipelines is not strictly necessary for extreme compression when principled transformation and normalization are properly configured. The dual-stage (rotation + scaling) approach redefines the Pareto front for accuracy versus efficiency; avoids mixed-precision fragmentation; and maintains generalizability across modalities and architectures. The theoretical analysis further substantiates the necessity of TNI mitigation for extreme quantization, and the empirical ablation confirms both innovations as mutually essential.

Practically, OScaR provides a deployable recipe for memory-bound inference in LLMs (and potentially streaming vision/diffusion models), supporting batch scaling, and opening avenues for hardware-tailored optimization. The codebase is accessible for integration.

Limitations and Future Directions

While OScaR reduces overhead relative to heavy pipelines (e.g., TurboQuant), the need for online Canalized Rotation incurs computational cost, notably in architectures using RoPE. Future extensions could investigate more hardware-aware or offline fusion strategies in the transform step, applicability to non-LLM autoregressive models (e.g., vision streaming, diffusion), and layer-wise adaptive precision. The current evaluation is LLM-centric; broadening to other contexts would further validate generality.

Conclusion

OScaR establishes a concise, robust, and efficient paradigm for extreme KV cache quantization in contemporary LLMs, overcoming structural limitations of per-channel schemes by principled energy redistribution and token-wise scaling. It exhibits universality across model families and modes, maintains near-lossless accuracy at 2-bit precision, and achieves substantial speed/memory gains. OScaR provides a critical framework for next-generation LLM deployment and serves as a practical reference point for memory-efficient sequence modeling (2605.19660).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 27 likes about this paper.