Token-Wise Asymmetric INT4 Quantization
- Token-wise asymmetric INT4 quantization is a method that independently quantizes each token using a lightweight block-diagonal Hadamard rotation to preserve model accuracy.
- It enables efficient KV-cache compression in long-context LLMs by aligning with fixed-size, page-aligned memory layouts and supporting fused kernel dequantization.
- Empirical results demonstrate nearly full BF16 accuracy recovery with significant memory savings and zero measurable throughput overhead in production-grade systems.
Token-wise asymmetric INT4 quantization, especially when combined with block-diagonal Hadamard rotation, is a quantization strategy that enables efficient and accurate KV-cache compression in Long-Context LLM serving under strict real-world system constraints. By quantizing each token independently and adopting a lightweight orthonormal rotation, this method achieves nearly lossless recovery of model accuracy while delivering significant memory savings and zero measurable throughput overhead. The approach is foundational to recent advances in production-grade LLM serving stacks where memory bandwidth, layout regularity, and kernel fusion are primary constraints (Jia et al., 21 Apr 2026).
1. Motivation and System Constraints
KV-cache memory is identified as the largest memory consumer in long-context LLM deployments, with scenarios such as Llama 4 Scout requiring in excess of 1.8 TiB for a context window of 10 million tokens—several times larger than the weights of the network. Modern deployment stacks (vLLM, SGLang, TensorRT-LLM) implement fixed-size, paged memory layouts and utilize fused attention kernels (e.g., FlashAttention) that directly operate on these layouts. Quantization or compression schemes that violate memory paging (e.g., channel-wise scaling across pages), rely on mixed precision within a page (such as mixing 2-bit and residual buffers), or require data-dependent indexing (e.g., codebook lookups) are incompatible with efficient serving since they disrupt memory coalescing and introduce non-trivial decode-time overhead.
Token-wise quantization circumvents these pitfalls by independently quantizing each token and each attention head, preserving the page structure and facilitating a single fused in-kernel dequantization. In practical auto-regressive decoding, latency is bandwidth-bound: any scheme introducing extra memory passes, such as an unfused rotation step, incurs a 1–3% latency penalty and directly reduces throughput. Practicability therefore mandates that compression solutions must restore nearly full BF16 accuracy while imposing no measurable overhead relative to naïve INT4 (Jia et al., 21 Apr 2026).
2. Mathematical Formulation
Let denote the per-head dimension with a block size dividing . Each token’s head vector is split into contiguous blocks.
- Block-Diagonal Hadamard Rotation: Each block receives an orthonormal transformation using a fixed Hadamard matrix . The complete transform is , so .
- Token-wise Asymmetric INT4 Quantization: For each rotated vector , the scale and zero-point 0 are
1
Each value is quantized as
2
with two 4-bit packed per byte in memory.
- Dequantization: In the attention computation kernel,
3
exactly recovers the floating-point token vector, with 4 being orthonormal so floating-point attention scores are preserved.
This formulation ensures that quantization and dequantization are strictly local to each token, preserve the memory layout, and are amenable to full kernel fusion (Jia et al., 21 Apr 2026).
3. Role and Impact of Block-Diagonal Hadamard Rotation
Without rotation, channel outliers in 5 induce large values in the scale 6, deteriorating quantization fidelity by constraining most coordinates to a small subset of quantization bins. The block-diagonal Hadamard rotation uniformly redistributes per-block energy, reducing per-coordinate outlier impact by a factor of approximately 7 and minimizing dynamic range disparity.
Empirical results (Qwen3-4B, five reasoning/coding tasks, mean score):
| Method | Mean Accuracy | Drop from BF16 |
|---|---|---|
| BF16 (no quant.) | 75.64 | – |
| Naïve INT4 | 0.00 | –75.64 |
| BDR-16 | 54.83 | –20.81 |
| BDR-64 | 72.29 | –3.35 |
| BDR-128 | 73.11 | –2.53 |
| Hessian+BDR-128 | 65.52 | –10.12 |
| KMeans 8 | 71.64 | –4.00 |
Applying a block-diagonal rotation with 9 or 0 recovers nearly all BF16 accuracy. More complex approaches provide minimal further gains once block-diagonal rotation is in place (Jia et al., 21 Apr 2026).
4. System-Aware Implementation
Fused Rotate-Quantize Kernel
Prefill operations (writing to KV-cache) and decode operations (reading during attention) are executed within a single CUDA/Triton kernel that fuses rotation, scale/zero-point calculation, quantization, and (on decode) dequantization. This in-register processing eliminates extra memory traversal; rotations on streamed tiles are performed directly within the computational kernel.
Memory Layout and End-to-End Throughput
Token-wise INT4 maintains the same token-major, page-aligned buffer structure as BF16 but with 1 smaller page sizes, eliminating fragmentation or page-table modification. Kernel profiling for Qwen3-32B decode step (22H100, batch size=32):
| Kernel | Runtime (µs) |
|---|---|
| Plain INT4 total decode | 533 |
| INT4 + fused rotate | 530 |
| INT4 + unfused rotate (extra) | 541 |
End-to-end throughput (23H100, 32 request concurrency, ctx=8192, gen=1024):
| Method | TPS/GPU | Acc (4B) | Acc (8B) |
|---|---|---|---|
| BF16 | 1 030 | 75.64 | 70.84 |
| INT4 | 1 217 | 0 | 0 |
| BDR-INT4 | 1 242 | 73.78 | 69.86 |
Fused rotation achieves throughput parity with naïve INT4, with accuracy approaching that of non-quantized BF16, and substantially exceeds BF16 throughput (Jia et al., 21 Apr 2026).
5. Empirical Results Across Architectures and Workloads
Evaluations on Qwen3-4B, Qwen3-8B, Qwen3-32B, and GLM-4.7 (358B) across GPQA, HumanEval, LiveCodeBench, AIME25, and MATH500 demonstrate robust generalization of BDR-INT4 methodology:
- With BDR-128, accuracy drop vs. BF16 is –2.5 points (4B), –0.9 points (8B), and negligible on GLM-4.7.
- Hessian+BDR-128 results in a –10 point drop (4B).
- KMeans (C=256) achieves a –4 point drop (4B), remaining inferior to BDR.
Throughput scaling experiments indicate BDR-INT4 tracks or marginally exceeds plain INT4 throughput over varying batch sizes and concurrency, outperforming BF16 by 10–40% at high concurrency. Under memory pressure (long context, high concurrency), "system TPS" reflects a tangible 20–40% advantage for (BDR-)INT4 over BF16, even as per-request TPS may give misleading impressions due to buffer limitations (Jia et al., 21 Apr 2026).
6. Practical Guidelines and Limitations
Token-wise quantization is currently the only viable low-bit scheme compatible with serving constraints such as paged memory layouts and fused kernel execution. A block-diagonal Hadamard rotation (block size ≈ 64–128) before asymmetric INT4 quantization effectively removes outlier-induced quantization errors, recovering over 97% of BF16 accuracy for fragile models. Implementation requires a single fused CUDA/Triton kernel encompassing rotation, quantization, and memory write, preserving throughput without measurable overhead.
Complex schemes—vector quantization, Hessian-aware methods—introduce additional calibration, irregular memory access, or complicate kernel fusion, while offering minimal incremental benefit once block-diagonal rotation is applied. Consequently, the system-aware methodology—token-wise asymmetric INT4 quantization with block-diagonal Hadamard rotation, sometimes referred to as "SAW-INT4"—constitutes the optimal trade-off in accuracy, efficiency, and deployability for production LLM serving (Jia et al., 21 Apr 2026).