IsoQuant: 4D Quaternion Orthogonalizer

Updated 28 May 2026

IsoQuant is a hardware-aligned blockwise rotation framework leveraging quaternion algebra and isoclinic rotations to decorrelate key-value vectors in LLMs.
It partitions feature vectors into 4D blocks and applies closed-form SO(4) rotations via quaternion sandwich maps, optimizing SIMD and CUDA implementations.
Benchmarks show IsoQuant achieves up to 6x kernel speedups with reduced arithmetic and memory overhead while matching or improving reconstruction MSE compared to 3D methods.

IsoQuant is a hardware-aligned blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of the special orthogonal group $SO(4)$ , designed as a computationally and memory-efficient stage-1 orthogonalizer for low-bit vector quantization in LLM key-value (KV) cache compression. Motivated by the prohibitive $O(d^2)$ arithmetic and storage costs of dense random orthogonal transforms and the suboptimal hardware alignment of 3D blockwise Clifford rotor methods such as RotorQuant, IsoQuant partitions the feature vector space into 4D blocks, applies mathematically principled $SO(4)$ rotations via closed-form quaternionic "sandwich" maps, and achieves efficient data decorrelation with strict hardware compatibility and kernel-level speedups (Ji, 30 Mar 2026).

1. Theoretical Foundations: Isoclinic Rotations and Quaternions

Orthogonal transforms are essential in decorrelating KV vectors before scalar quantization in low-bit LLM inference pipelines. Dense $d \times d$ random orthogonal mappings, as used in TurboQuant, offer robust decorrelation but incur $O(d^2)$ arithmetic and storage, conflicting with autoregressive decoding constraints. RotorQuant transitions to $O(d)$ cost by blockwise 3D Clifford rotors, but its 3D partitioning leads to incomplete tiling and only three local mixing degrees of freedom.

IsoQuant adopts a blockwise 4D structure, leveraging the isoclinic (double SU(2)) decomposition of $SO(4)$ , represented via quaternionic algebra. Every $v\in\mathbb{R}^4$ is mapped to a quaternion $v = x_0 + x_1 i + x_2 j + x_3 k$ , and each $SO(4)$ rotation is parameterized by a pair of unit quaternions $O(d^2)$ 0. Crucially, this provides six real degrees of rotational freedom within each block, maximizing feature spreading and perfectly aligning with power-of-two head/channel sizes ubiquitous in LLM architectures.

2. Closed-Form $O(d^2)$ 1 Transform via Quaternion Algebra

The core IsoQuant transform for a 4D block $O(d^2)$ 2 is: $O(d^2)$ 3 where $O(d^2)$ 4 and $O(d^2)$ 5 are unit quaternions and $O(d^2)$ 6 is the conjugate. This operation is an orthogonal map on $O(d^2)$ 7 (i.e., in $O(d^2)$ 8), with inverse

$O(d^2)$ 9

This encapsulates every rotation in $SO(4)$ 0 (modulo the $SO(4)$ 1 double-cover ambiguity), enabling maximal local mixing of block features. Unlike constructing blockwise $SO(4)$ 2 matrices, this formulation is amenable to efficient implementation as sequential quaternion multiplications, facilitating optimized SIMD and CUDA kernels.

3. IsoQuant Algorithmic Variants

IsoQuant comprises several blockwise transform variants, trading off expressivity and compute:

IsoQuant-Full realizes the complete $SO(4)$ 3 rotation per 4D block. Each block maintains two parameter vectors $SO(4)$ 4, normalized to quaternions $SO(4)$ 5. The forward pass on $SO(4)$ 6 is $SO(4)$ 7, followed by quantization $SO(4)$ 8 (b bits/coordinate), with recovery by the inverse map.
IsoQuant-Fast restricts each block to a single isoclinic factor ( $SO(4)$ 9 only). The transform reduces to $d \times d$ 0, a left-isoclinic ( $d \times d$ 1 subgroup) mapping, further halving parameter count and arithmetic at a modest cost in local mixing.
IsoQuant-2D serves as a minimal baseline, partitioning into $d \times d$ 2D blocks $d \times d$ 3, each subject to a planar rotation parameterized by an angle $d \times d$ 4: $d \times d$ 5.

A summary table of arithmetic and parameter counts at $d \times d$ 6 is given below:

Variant	FMAs (d=128)	Parameter Count (d=128)
IsoQuant-Full	1,024	256
IsoQuant-Fast	512	128
IsoQuant-2D	256	128
RotorQuant (3D)	2,408	172
TurboQuant (Dense)	16,384	16,384

IsoQuant avoids the 3D block "remainder"/tail and suboptimal register packing that affect RotorQuant, with all variants retaining $d \times d$ 7 scaling but substantially reduced constants.

4. Computational Complexity and CUDA Kernel Performance

The quaternionic kernel structure of IsoQuant is conducive to register-efficient, fused CUDA implementation. Each quaternion multiply comprises approximately 16 scalar multiplies and 12 adds, taken as 16 fused multiply-adds (FMAs). For $d \times d$ 8 (i.e., 32 blocks):

IsoQuant-Full: $d \times d$ 9 FMAs per forward pass
IsoQuant-Fast: $O(d^2)$ 0 FMAs
IsoQuant-2D: $O(d^2)$ 1 FMAs

Benchmarked on RTX 4090 hardware in batch-8192 scenarios across $O(d^2)$ 2, $O(d^2)$ 3 bit widths, and fp16/fp32 dtypes, IsoQuant demonstrates kernel-level speedups over RotorQuant of approximately $O(d^2)$ 4– $O(d^2)$ 5 on average, with observed peaks above $O(d^2)$ 6 in select configurations (e.g., fp16, $O(d^2)$ 7, $O(d^2)$ 8 yields $O(d^2)$ 9; fp16, $O(d)$ 0, $O(d)$ 1 yields $O(d)$ 2; fp32, $O(d)$ 3, $O(d)$ 4 yields $O(d)$ 5).

5. Quantization Fidelity and Reconstruction MSE

In all $O(d)$ 6 tested synthetic-vector settings, reconstructed MSE of IsoQuant-Full, IsoQuant-Fast, and IsoQuant-2D matches or marginally outperforms RotorQuant across all measured configurations. No diminishment of quantization fidelity is observed when replacing 3D rotors with 4D isoclinic rotations, validating the decorrelation efficacy of the quaternionic blockwise approach.

6. Hardware Alignment, Memory, and Systems Integration

IsoQuant's strict 4D block structure is inherently aligned with power-of-two head dimensions ( $O(d)$ 7 64, 128, 256, ...), avoiding the residual 2D tail and related inefficiencies of 3D blocking. Each float4/int32 SIMD lane maps directly to a quaternion block, optimizing vectorized loads, stores, and register management in both CPU SIMD and GPU warp execution contexts. Fusing all blockwise operations (rotations, quantization, inverse transform) within registers minimizes memory traffic, further accelerating kernel throughput.

Parameter storage is reduced in IsoQuant due to block size and quaternion parameterization. For $O(d)$ 8: IsoQuant-Full (256 parameters), IsoQuant-Fast (128), IsoQuant-2D (128), RotorQuant (172), TurboQuant (16,384). This aligns with constrained memory budgets in on-device LLM inference.

IsoQuant slots cleanly into the stage-1 decorrelation step of LLM KV-cache compression pipelines and remains compatible with stage-2 schemes (e.g., QJL residual correction). By reducing the overhead of the orthogonalization stage, it facilitates more efficient inference as LLM context lengths scale.

7. Summary and Outlook

IsoQuant replaces blockwise 3D Clifford rotors with 4D isoclinic rotations parameterized by quaternion pairs, providing a mathematically concise and hardware-optimized alternative for stage-1 vector quantization transforms in KV caches of LLMs. The framework encompasses full $O(d)$ 9 rotations (IsoQuant-Full), a reduced-cost $SO(4)$ 0 slice (IsoQuant-Fast), and a lightweight $SO(4)$ 1D fallback, each offering linear arithmetic cost with substantially reduced per-block constants relative to RotorQuant. Kernel-level benchmarks demonstrate $SO(4)$ 2– $SO(4)$ 3 average speedup and no loss in reconstruction accuracy in synthetic settings. Full end-to-end KV-cache evaluation is identified as ongoing work, but current evidence supports IsoQuant as a practical and theoretically grounded component for scalable, low-bit LLM inference (Ji, 30 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IsoQuant.

IsoQuant: 4D Quaternion Orthogonalizer

1. Theoretical Foundations: Isoclinic Rotations and Quaternions

2. Closed-Form $O(d^2)$ 1 Transform via Quaternion Algebra

3. IsoQuant Algorithmic Variants

4. Computational Complexity and CUDA Kernel Performance

5. Quantization Fidelity and Reconstruction MSE

6. Hardware Alignment, Memory, and Systems Integration

7. Summary and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

IsoQuant: 4D Quaternion Orthogonalizer

1. Theoretical Foundations: Isoclinic Rotations and Quaternions

2. Closed-Form O(d2)O(d^2)O(d2)1 Transform via Quaternion Algebra

3. IsoQuant Algorithmic Variants

4. Computational Complexity and CUDA Kernel Performance

5. Quantization Fidelity and Reconstruction MSE

6. Hardware Alignment, Memory, and Systems Integration

7. Summary and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

2. Closed-Form $O(d^2)$ 1 Transform via Quaternion Algebra