- The paper presents a dual-stage framework, OScaR, that mitigates token norm imbalance using canalized rotation and omni-token scaling for extreme low-bit KV cache quantization.
- It employs a training-free, CUDA-optimized implementation with fused Hadamard-norm kernels to achieve high efficiency, robust accuracy, and reduced memory footprint in both text-only and multi-modal LLMs.
- Empirical results demonstrate up to a 3.0x decoding speedup and a 5.3x memory reduction, positioning OScaR on the Pareto front for balancing accuracy and efficiency in LLM inference.
OScaR: A Framework for Extreme KV Cache Quantization in X-LLMs
Motivation and Problem Statement
The increasing adoption of long-context and multi-modal LLMs mandates efficient memory management strategies, as the memory footprint of the attention Key-Value (KV) cache scales linearly with sequence length and quickly becomes a primary bottleneck in both inference throughput and deployment scalability. Extreme low-bit KV cache quantization is a promising avenue for reclaiming memory efficiency, but it is fundamentally challenged by the presence of both channel-wise and token-wise outliers, notably Token Norm Imbalance (TNI), which the paper rigorously identifies as the structural impediment to high-fidelity per-channel quantization (2605.19660).
Analysis of Token Norm Imbalance (TNI)
TNI denotes substantial intra-channel token norm variance, manifesting as consistent low-norm outlier tokens ("attention sinks") or, in multi-modal contexts, as broader norm disparitiesโboth within modality sequences and across modalities. Empirical and theoretical analyses reveal that per-channel quantization schemes fail under TNI, as quantization parameters must span tokens with widely divergent norms. This amplifies quantization error, especially at extreme bit-widths, undermining accuracy and model robustness in both text-only and multi-modal LLMs. The paper substantiates this with systematic norm profiling, rigorous error derivations, and quantitative evaluation across variant LLM architectures.
Methodological Contributions: OScaR Framework
OScaR (Omni-Scaled Canalized Rotation) is introduced to directly address TNI in a lightweight, training-free manner, guided by Occam's Razor for minimal auxiliary overhead. The approach proceeds in two stages:
- Canalized Rotation: The Hadamard Transform is applied online to Keys (and Queries), redistributing outlier channel energy and preventing scaling-induced artifacts. This orthogonal transformation, optimized for Tensor Core execution, ensures that subsequent normalization is not dominated by channel-wise outliers.
- Omni-Token Scaling: After rotation, token-wise scaling balances the norm of each token across the sequence, using rsqrt-based normalization (hardware-accelerated). Unlike direct token scalingโwhich introduces artificial channel-wise outlier artifactsโthis process operates safely post-rotation, mitigating both TNI and outlier effects without degrading per-channel quantization fidelity.
Both steps are essential: Canalized Rotation ensures the scaling step does not artificially inflate channel ranges, while Omni-Token Scaling resolves TNI. The framework is universally applicable across text-only, multi-modal, and omni-modal LLMs, and demonstrates robustness under INT2 quantization.
System-Level Design and CUDA Implementation
OScaR features a CUDA-optimized implementation, leveraging HadaCore and BitDecoding primitives with fused Hadamard-norm kernels, GPU-efficient quantization, and dequantization-attention fusion. All transformations and quantization occur online, with residual handling and periodic cache packing for long-context sequences. Token-wise norm metadata is efficiently managed, and memory layout adheres to 2-bit representation with minimal overhead.
Complexity analysis, supported by operation counts and cost conversion, positions OScaR favorably against strong baselines in terms of both arithmetic and lookup cost, avoiding LUT-induced hardware inefficiencies and achieving competitive empirical throughput.
Empirical Evaluation and Numerical Results
OScaR is extensively evaluated across leading benchmarks and LLM variants:
- Text-Only LLMs: On LongBench-E, OScaR achieves the highest average accuracy among quantized methods (41.75%), outperforming the next-best by 1.01 percentage points; NIAH retrieval accuracy peaks at 96.5%, minimally exceeding even the 16-bit baseline.
- Multi-Modal/Omini-Modal LLMs: On OCRBench and DocVQA, OScaR outperforms competing quantized methods by up to 2.5 percentage points and matches or slightly exceeds full-precision performance under INT2 quantization. On MMAU-Pro, it achieves open-ended QA and AIF scores that exceed both the 16-bit baseline and all strong quantized alternatives.
- Efficiency: At 128K context length, OScaR delivers a 3.0x decoding speedup relative to FlashDecoding-v2, reduces memory footprint by 5.3x, and boosts throughput by 4.1x. Latency remains stable across context lengths, outperforming TurboQuant+ at long sequences. OScaR sits prominently on the accuracy-efficiency Pareto front, offering the highest accuracy at a moderate computational cost.
Theoretical and Practical Implications
OScaR advances KV cache quantization methodology by demonstrating that complexity in quantization pipelines is not strictly necessary for extreme compression when principled transformation and normalization are properly configured. The dual-stage (rotation + scaling) approach redefines the Pareto front for accuracy versus efficiency; avoids mixed-precision fragmentation; and maintains generalizability across modalities and architectures. The theoretical analysis further substantiates the necessity of TNI mitigation for extreme quantization, and the empirical ablation confirms both innovations as mutually essential.
Practically, OScaR provides a deployable recipe for memory-bound inference in LLMs (and potentially streaming vision/diffusion models), supporting batch scaling, and opening avenues for hardware-tailored optimization. The codebase is accessible for integration.
Limitations and Future Directions
While OScaR reduces overhead relative to heavy pipelines (e.g., TurboQuant), the need for online Canalized Rotation incurs computational cost, notably in architectures using RoPE. Future extensions could investigate more hardware-aware or offline fusion strategies in the transform step, applicability to non-LLM autoregressive models (e.g., vision streaming, diffusion), and layer-wise adaptive precision. The current evaluation is LLM-centric; broadening to other contexts would further validate generality.
Conclusion
OScaR establishes a concise, robust, and efficient paradigm for extreme KV cache quantization in contemporary LLMs, overcoming structural limitations of per-channel schemes by principled energy redistribution and token-wise scaling. It exhibits universality across model families and modes, maintains near-lossless accuracy at 2-bit precision, and achieves substantial speed/memory gains. OScaR provides a critical framework for next-generation LLM deployment and serves as a practical reference point for memory-efficient sequence modeling (2605.19660).