Papers
Topics
Authors
Recent
Search
2000 character limit reached

IsoQuant: 4D Quaternion Orthogonalizer

Updated 28 May 2026
  • IsoQuant is a hardware-aligned blockwise rotation framework leveraging quaternion algebra and isoclinic rotations to decorrelate key-value vectors in LLMs.
  • It partitions feature vectors into 4D blocks and applies closed-form SO(4) rotations via quaternion sandwich maps, optimizing SIMD and CUDA implementations.
  • Benchmarks show IsoQuant achieves up to 6x kernel speedups with reduced arithmetic and memory overhead while matching or improving reconstruction MSE compared to 3D methods.

IsoQuant is a hardware-aligned blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of the special orthogonal group SO(4)SO(4), designed as a computationally and memory-efficient stage-1 orthogonalizer for low-bit vector quantization in LLM key-value (KV) cache compression. Motivated by the prohibitive O(d2)O(d^2) arithmetic and storage costs of dense random orthogonal transforms and the suboptimal hardware alignment of 3D blockwise Clifford rotor methods such as RotorQuant, IsoQuant partitions the feature vector space into 4D blocks, applies mathematically principled SO(4)SO(4) rotations via closed-form quaternionic "sandwich" maps, and achieves efficient data decorrelation with strict hardware compatibility and kernel-level speedups (Ji, 30 Mar 2026).

1. Theoretical Foundations: Isoclinic Rotations and Quaternions

Orthogonal transforms are essential in decorrelating KV vectors before scalar quantization in low-bit LLM inference pipelines. Dense d×dd \times d random orthogonal mappings, as used in TurboQuant, offer robust decorrelation but incur O(d2)O(d^2) arithmetic and storage, conflicting with autoregressive decoding constraints. RotorQuant transitions to O(d)O(d) cost by blockwise 3D Clifford rotors, but its 3D partitioning leads to incomplete tiling and only three local mixing degrees of freedom.

IsoQuant adopts a blockwise 4D structure, leveraging the isoclinic (double SU(2)) decomposition of SO(4)SO(4), represented via quaternionic algebra. Every vR4v\in\mathbb{R}^4 is mapped to a quaternion v=x0+x1i+x2j+x3kv = x_0 + x_1 i + x_2 j + x_3 k, and each SO(4)SO(4) rotation is parameterized by a pair of unit quaternions O(d2)O(d^2)0. Crucially, this provides six real degrees of rotational freedom within each block, maximizing feature spreading and perfectly aligning with power-of-two head/channel sizes ubiquitous in LLM architectures.

2. Closed-Form O(d2)O(d^2)1 Transform via Quaternion Algebra

The core IsoQuant transform for a 4D block O(d2)O(d^2)2 is: O(d2)O(d^2)3 where O(d2)O(d^2)4 and O(d2)O(d^2)5 are unit quaternions and O(d2)O(d^2)6 is the conjugate. This operation is an orthogonal map on O(d2)O(d^2)7 (i.e., in O(d2)O(d^2)8), with inverse

O(d2)O(d^2)9

This encapsulates every rotation in SO(4)SO(4)0 (modulo the SO(4)SO(4)1 double-cover ambiguity), enabling maximal local mixing of block features. Unlike constructing blockwise SO(4)SO(4)2 matrices, this formulation is amenable to efficient implementation as sequential quaternion multiplications, facilitating optimized SIMD and CUDA kernels.

3. IsoQuant Algorithmic Variants

IsoQuant comprises several blockwise transform variants, trading off expressivity and compute:

  • IsoQuant-Full realizes the complete SO(4)SO(4)3 rotation per 4D block. Each block maintains two parameter vectors SO(4)SO(4)4, normalized to quaternions SO(4)SO(4)5. The forward pass on SO(4)SO(4)6 is SO(4)SO(4)7, followed by quantization SO(4)SO(4)8 (b bits/coordinate), with recovery by the inverse map.
  • IsoQuant-Fast restricts each block to a single isoclinic factor (SO(4)SO(4)9 only). The transform reduces to d×dd \times d0, a left-isoclinic (d×dd \times d1 subgroup) mapping, further halving parameter count and arithmetic at a modest cost in local mixing.
  • IsoQuant-2D serves as a minimal baseline, partitioning into d×dd \times d2D blocks d×dd \times d3, each subject to a planar rotation parameterized by an angle d×dd \times d4: d×dd \times d5.

A summary table of arithmetic and parameter counts at d×dd \times d6 is given below:

Variant FMAs (d=128) Parameter Count (d=128)
IsoQuant-Full 1,024 256
IsoQuant-Fast 512 128
IsoQuant-2D 256 128
RotorQuant (3D) 2,408 172
TurboQuant (Dense) 16,384 16,384

IsoQuant avoids the 3D block "remainder"/tail and suboptimal register packing that affect RotorQuant, with all variants retaining d×dd \times d7 scaling but substantially reduced constants.

4. Computational Complexity and CUDA Kernel Performance

The quaternionic kernel structure of IsoQuant is conducive to register-efficient, fused CUDA implementation. Each quaternion multiply comprises approximately 16 scalar multiplies and 12 adds, taken as 16 fused multiply-adds (FMAs). For d×dd \times d8 (i.e., 32 blocks):

  • IsoQuant-Full: d×dd \times d9 FMAs per forward pass
  • IsoQuant-Fast: O(d2)O(d^2)0 FMAs
  • IsoQuant-2D: O(d2)O(d^2)1 FMAs

Benchmarked on RTX 4090 hardware in batch-8192 scenarios across O(d2)O(d^2)2, O(d2)O(d^2)3 bit widths, and fp16/fp32 dtypes, IsoQuant demonstrates kernel-level speedups over RotorQuant of approximately O(d2)O(d^2)4–O(d2)O(d^2)5 on average, with observed peaks above O(d2)O(d^2)6 in select configurations (e.g., fp16, O(d2)O(d^2)7, O(d2)O(d^2)8 yields O(d2)O(d^2)9; fp16, O(d)O(d)0, O(d)O(d)1 yields O(d)O(d)2; fp32, O(d)O(d)3, O(d)O(d)4 yields O(d)O(d)5).

5. Quantization Fidelity and Reconstruction MSE

In all O(d)O(d)6 tested synthetic-vector settings, reconstructed MSE of IsoQuant-Full, IsoQuant-Fast, and IsoQuant-2D matches or marginally outperforms RotorQuant across all measured configurations. No diminishment of quantization fidelity is observed when replacing 3D rotors with 4D isoclinic rotations, validating the decorrelation efficacy of the quaternionic blockwise approach.

6. Hardware Alignment, Memory, and Systems Integration

IsoQuant's strict 4D block structure is inherently aligned with power-of-two head dimensions (O(d)O(d)7 64, 128, 256, ...), avoiding the residual 2D tail and related inefficiencies of 3D blocking. Each float4/int32 SIMD lane maps directly to a quaternion block, optimizing vectorized loads, stores, and register management in both CPU SIMD and GPU warp execution contexts. Fusing all blockwise operations (rotations, quantization, inverse transform) within registers minimizes memory traffic, further accelerating kernel throughput.

Parameter storage is reduced in IsoQuant due to block size and quaternion parameterization. For O(d)O(d)8: IsoQuant-Full (256 parameters), IsoQuant-Fast (128), IsoQuant-2D (128), RotorQuant (172), TurboQuant (16,384). This aligns with constrained memory budgets in on-device LLM inference.

IsoQuant slots cleanly into the stage-1 decorrelation step of LLM KV-cache compression pipelines and remains compatible with stage-2 schemes (e.g., QJL residual correction). By reducing the overhead of the orthogonalization stage, it facilitates more efficient inference as LLM context lengths scale.

7. Summary and Outlook

IsoQuant replaces blockwise 3D Clifford rotors with 4D isoclinic rotations parameterized by quaternion pairs, providing a mathematically concise and hardware-optimized alternative for stage-1 vector quantization transforms in KV caches of LLMs. The framework encompasses full O(d)O(d)9 rotations (IsoQuant-Full), a reduced-cost SO(4)SO(4)0 slice (IsoQuant-Fast), and a lightweight SO(4)SO(4)1D fallback, each offering linear arithmetic cost with substantially reduced per-block constants relative to RotorQuant. Kernel-level benchmarks demonstrate SO(4)SO(4)2–SO(4)SO(4)3 average speedup and no loss in reconstruction accuracy in synthetic settings. Full end-to-end KV-cache evaluation is identified as ongoing work, but current evidence supports IsoQuant as a practical and theoretically grounded component for scalable, low-bit LLM inference (Ji, 30 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IsoQuant.