IsoQuant: 4D Quaternion Orthogonalizer
- IsoQuant is a hardware-aligned blockwise rotation framework leveraging quaternion algebra and isoclinic rotations to decorrelate key-value vectors in LLMs.
- It partitions feature vectors into 4D blocks and applies closed-form SO(4) rotations via quaternion sandwich maps, optimizing SIMD and CUDA implementations.
- Benchmarks show IsoQuant achieves up to 6x kernel speedups with reduced arithmetic and memory overhead while matching or improving reconstruction MSE compared to 3D methods.
IsoQuant is a hardware-aligned blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of the special orthogonal group , designed as a computationally and memory-efficient stage-1 orthogonalizer for low-bit vector quantization in LLM key-value (KV) cache compression. Motivated by the prohibitive arithmetic and storage costs of dense random orthogonal transforms and the suboptimal hardware alignment of 3D blockwise Clifford rotor methods such as RotorQuant, IsoQuant partitions the feature vector space into 4D blocks, applies mathematically principled rotations via closed-form quaternionic "sandwich" maps, and achieves efficient data decorrelation with strict hardware compatibility and kernel-level speedups (Ji, 30 Mar 2026).
1. Theoretical Foundations: Isoclinic Rotations and Quaternions
Orthogonal transforms are essential in decorrelating KV vectors before scalar quantization in low-bit LLM inference pipelines. Dense random orthogonal mappings, as used in TurboQuant, offer robust decorrelation but incur arithmetic and storage, conflicting with autoregressive decoding constraints. RotorQuant transitions to cost by blockwise 3D Clifford rotors, but its 3D partitioning leads to incomplete tiling and only three local mixing degrees of freedom.
IsoQuant adopts a blockwise 4D structure, leveraging the isoclinic (double SU(2)) decomposition of , represented via quaternionic algebra. Every is mapped to a quaternion , and each rotation is parameterized by a pair of unit quaternions 0. Crucially, this provides six real degrees of rotational freedom within each block, maximizing feature spreading and perfectly aligning with power-of-two head/channel sizes ubiquitous in LLM architectures.
2. Closed-Form 1 Transform via Quaternion Algebra
The core IsoQuant transform for a 4D block 2 is: 3 where 4 and 5 are unit quaternions and 6 is the conjugate. This operation is an orthogonal map on 7 (i.e., in 8), with inverse
9
This encapsulates every rotation in 0 (modulo the 1 double-cover ambiguity), enabling maximal local mixing of block features. Unlike constructing blockwise 2 matrices, this formulation is amenable to efficient implementation as sequential quaternion multiplications, facilitating optimized SIMD and CUDA kernels.
3. IsoQuant Algorithmic Variants
IsoQuant comprises several blockwise transform variants, trading off expressivity and compute:
- IsoQuant-Full realizes the complete 3 rotation per 4D block. Each block maintains two parameter vectors 4, normalized to quaternions 5. The forward pass on 6 is 7, followed by quantization 8 (b bits/coordinate), with recovery by the inverse map.
- IsoQuant-Fast restricts each block to a single isoclinic factor (9 only). The transform reduces to 0, a left-isoclinic (1 subgroup) mapping, further halving parameter count and arithmetic at a modest cost in local mixing.
- IsoQuant-2D serves as a minimal baseline, partitioning into 2D blocks 3, each subject to a planar rotation parameterized by an angle 4: 5.
A summary table of arithmetic and parameter counts at 6 is given below:
| Variant | FMAs (d=128) | Parameter Count (d=128) |
|---|---|---|
| IsoQuant-Full | 1,024 | 256 |
| IsoQuant-Fast | 512 | 128 |
| IsoQuant-2D | 256 | 128 |
| RotorQuant (3D) | 2,408 | 172 |
| TurboQuant (Dense) | 16,384 | 16,384 |
IsoQuant avoids the 3D block "remainder"/tail and suboptimal register packing that affect RotorQuant, with all variants retaining 7 scaling but substantially reduced constants.
4. Computational Complexity and CUDA Kernel Performance
The quaternionic kernel structure of IsoQuant is conducive to register-efficient, fused CUDA implementation. Each quaternion multiply comprises approximately 16 scalar multiplies and 12 adds, taken as 16 fused multiply-adds (FMAs). For 8 (i.e., 32 blocks):
- IsoQuant-Full: 9 FMAs per forward pass
- IsoQuant-Fast: 0 FMAs
- IsoQuant-2D: 1 FMAs
Benchmarked on RTX 4090 hardware in batch-8192 scenarios across 2, 3 bit widths, and fp16/fp32 dtypes, IsoQuant demonstrates kernel-level speedups over RotorQuant of approximately 4–5 on average, with observed peaks above 6 in select configurations (e.g., fp16, 7, 8 yields 9; fp16, 0, 1 yields 2; fp32, 3, 4 yields 5).
5. Quantization Fidelity and Reconstruction MSE
In all 6 tested synthetic-vector settings, reconstructed MSE of IsoQuant-Full, IsoQuant-Fast, and IsoQuant-2D matches or marginally outperforms RotorQuant across all measured configurations. No diminishment of quantization fidelity is observed when replacing 3D rotors with 4D isoclinic rotations, validating the decorrelation efficacy of the quaternionic blockwise approach.
6. Hardware Alignment, Memory, and Systems Integration
IsoQuant's strict 4D block structure is inherently aligned with power-of-two head dimensions (7 64, 128, 256, ...), avoiding the residual 2D tail and related inefficiencies of 3D blocking. Each float4/int32 SIMD lane maps directly to a quaternion block, optimizing vectorized loads, stores, and register management in both CPU SIMD and GPU warp execution contexts. Fusing all blockwise operations (rotations, quantization, inverse transform) within registers minimizes memory traffic, further accelerating kernel throughput.
Parameter storage is reduced in IsoQuant due to block size and quaternion parameterization. For 8: IsoQuant-Full (256 parameters), IsoQuant-Fast (128), IsoQuant-2D (128), RotorQuant (172), TurboQuant (16,384). This aligns with constrained memory budgets in on-device LLM inference.
IsoQuant slots cleanly into the stage-1 decorrelation step of LLM KV-cache compression pipelines and remains compatible with stage-2 schemes (e.g., QJL residual correction). By reducing the overhead of the orthogonalization stage, it facilitates more efficient inference as LLM context lengths scale.
7. Summary and Outlook
IsoQuant replaces blockwise 3D Clifford rotors with 4D isoclinic rotations parameterized by quaternion pairs, providing a mathematically concise and hardware-optimized alternative for stage-1 vector quantization transforms in KV caches of LLMs. The framework encompasses full 9 rotations (IsoQuant-Full), a reduced-cost 0 slice (IsoQuant-Fast), and a lightweight 1D fallback, each offering linear arithmetic cost with substantially reduced per-block constants relative to RotorQuant. Kernel-level benchmarks demonstrate 2–3 average speedup and no loss in reconstruction accuracy in synthetic settings. Full end-to-end KV-cache evaluation is identified as ongoing work, but current evidence supports IsoQuant as a practical and theoretically grounded component for scalable, low-bit LLM inference (Ji, 30 Mar 2026).