Papers
Topics
Authors
Recent
Search
2000 character limit reached

Token-Wise Asymmetric INT4 Quantization

Updated 23 April 2026
  • Token-wise asymmetric INT4 quantization is a method that independently quantizes each token using a lightweight block-diagonal Hadamard rotation to preserve model accuracy.
  • It enables efficient KV-cache compression in long-context LLMs by aligning with fixed-size, page-aligned memory layouts and supporting fused kernel dequantization.
  • Empirical results demonstrate nearly full BF16 accuracy recovery with significant memory savings and zero measurable throughput overhead in production-grade systems.

Token-wise asymmetric INT4 quantization, especially when combined with block-diagonal Hadamard rotation, is a quantization strategy that enables efficient and accurate KV-cache compression in Long-Context LLM serving under strict real-world system constraints. By quantizing each token independently and adopting a lightweight orthonormal rotation, this method achieves nearly lossless recovery of model accuracy while delivering significant memory savings and zero measurable throughput overhead. The approach is foundational to recent advances in production-grade LLM serving stacks where memory bandwidth, layout regularity, and kernel fusion are primary constraints (Jia et al., 21 Apr 2026).

1. Motivation and System Constraints

KV-cache memory is identified as the largest memory consumer in long-context LLM deployments, with scenarios such as Llama 4 Scout requiring in excess of 1.8 TiB for a context window of 10 million tokens—several times larger than the weights of the network. Modern deployment stacks (vLLM, SGLang, TensorRT-LLM) implement fixed-size, paged memory layouts and utilize fused attention kernels (e.g., FlashAttention) that directly operate on these layouts. Quantization or compression schemes that violate memory paging (e.g., channel-wise scaling across pages), rely on mixed precision within a page (such as mixing 2-bit and residual buffers), or require data-dependent indexing (e.g., codebook lookups) are incompatible with efficient serving since they disrupt memory coalescing and introduce non-trivial decode-time overhead.

Token-wise quantization circumvents these pitfalls by independently quantizing each token and each attention head, preserving the page structure and facilitating a single fused in-kernel dequantization. In practical auto-regressive decoding, latency is bandwidth-bound: any scheme introducing extra memory passes, such as an unfused rotation step, incurs a 1–3% latency penalty and directly reduces throughput. Practicability therefore mandates that compression solutions must restore nearly full BF16 accuracy while imposing no measurable overhead relative to naïve INT4 (Jia et al., 21 Apr 2026).

2. Mathematical Formulation

Let dhd_h denote the per-head dimension with a block size bb dividing dhd_h. Each token’s head vector xt∈Rdhx_t \in \mathbb{R}^{d_h} is split into m=dh/bm = d_h / b contiguous blocks.

  • Block-Diagonal Hadamard Rotation: Each block receives an orthonormal transformation using a fixed Hadamard matrix Hb∈{±1}b×bH_b \in \{\pm1\}^{b\times b}. The complete transform is H=diag(Hb,...,Hb)H = \text{diag}(H_b, ..., H_b), so xt′=Hxtx_t' = H x_t.
  • Token-wise Asymmetric INT4 Quantization: For each rotated vector xt′x_t', the scale StS_t and zero-point bb0 are

bb1

Each value is quantized as

bb2

with two 4-bit packed per byte in memory.

  • Dequantization: In the attention computation kernel,

bb3

exactly recovers the floating-point token vector, with bb4 being orthonormal so floating-point attention scores are preserved.

This formulation ensures that quantization and dequantization are strictly local to each token, preserve the memory layout, and are amenable to full kernel fusion (Jia et al., 21 Apr 2026).

3. Role and Impact of Block-Diagonal Hadamard Rotation

Without rotation, channel outliers in bb5 induce large values in the scale bb6, deteriorating quantization fidelity by constraining most coordinates to a small subset of quantization bins. The block-diagonal Hadamard rotation uniformly redistributes per-block energy, reducing per-coordinate outlier impact by a factor of approximately bb7 and minimizing dynamic range disparity.

Empirical results (Qwen3-4B, five reasoning/coding tasks, mean score):

Method Mean Accuracy Drop from BF16
BF16 (no quant.) 75.64 –
Naïve INT4 0.00 –75.64
BDR-16 54.83 –20.81
BDR-64 72.29 –3.35
BDR-128 73.11 –2.53
Hessian+BDR-128 65.52 –10.12
KMeans bb8 71.64 –4.00

Applying a block-diagonal rotation with bb9 or dhd_h0 recovers nearly all BF16 accuracy. More complex approaches provide minimal further gains once block-diagonal rotation is in place (Jia et al., 21 Apr 2026).

4. System-Aware Implementation

Fused Rotate-Quantize Kernel

Prefill operations (writing to KV-cache) and decode operations (reading during attention) are executed within a single CUDA/Triton kernel that fuses rotation, scale/zero-point calculation, quantization, and (on decode) dequantization. This in-register processing eliminates extra memory traversal; rotations on streamed tiles are performed directly within the computational kernel.

Memory Layout and End-to-End Throughput

Token-wise INT4 maintains the same token-major, page-aligned buffer structure as BF16 but with dhd_h1 smaller page sizes, eliminating fragmentation or page-table modification. Kernel profiling for Qwen3-32B decode step (2dhd_h2H100, batch size=32):

Kernel Runtime (µs)
Plain INT4 total decode 533
INT4 + fused rotate 530
INT4 + unfused rotate (extra) 541

End-to-end throughput (2dhd_h3H100, 32 request concurrency, ctx=8192, gen=1024):

Method TPS/GPU Acc (4B) Acc (8B)
BF16 1 030 75.64 70.84
INT4 1 217 0 0
BDR-INT4 1 242 73.78 69.86

Fused rotation achieves throughput parity with naïve INT4, with accuracy approaching that of non-quantized BF16, and substantially exceeds BF16 throughput (Jia et al., 21 Apr 2026).

5. Empirical Results Across Architectures and Workloads

Evaluations on Qwen3-4B, Qwen3-8B, Qwen3-32B, and GLM-4.7 (358B) across GPQA, HumanEval, LiveCodeBench, AIME25, and MATH500 demonstrate robust generalization of BDR-INT4 methodology:

  • With BDR-128, accuracy drop vs. BF16 is –2.5 points (4B), –0.9 points (8B), and negligible on GLM-4.7.
  • Hessian+BDR-128 results in a –10 point drop (4B).
  • KMeans (C=256) achieves a –4 point drop (4B), remaining inferior to BDR.

Throughput scaling experiments indicate BDR-INT4 tracks or marginally exceeds plain INT4 throughput over varying batch sizes and concurrency, outperforming BF16 by 10–40% at high concurrency. Under memory pressure (long context, high concurrency), "system TPS" reflects a tangible 20–40% advantage for (BDR-)INT4 over BF16, even as per-request TPS may give misleading impressions due to buffer limitations (Jia et al., 21 Apr 2026).

6. Practical Guidelines and Limitations

Token-wise quantization is currently the only viable low-bit scheme compatible with serving constraints such as paged memory layouts and fused kernel execution. A block-diagonal Hadamard rotation (block size ≈ 64–128) before asymmetric INT4 quantization effectively removes outlier-induced quantization errors, recovering over 97% of BF16 accuracy for fragile models. Implementation requires a single fused CUDA/Triton kernel encompassing rotation, quantization, and memory write, preserving throughput without measurable overhead.

Complex schemes—vector quantization, Hessian-aware methods—introduce additional calibration, irregular memory access, or complicate kernel fusion, while offering minimal incremental benefit once block-diagonal rotation is applied. Consequently, the system-aware methodology—token-wise asymmetric INT4 quantization with block-diagonal Hadamard rotation, sometimes referred to as "SAW-INT4"—constitutes the optimal trade-off in accuracy, efficiency, and deployability for production LLM serving (Jia et al., 21 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token-Wise Asymmetric INT4 Quantization.