Block-wise Triplet Quantization
- Block-wise triplet quantization is a method of compressing high-dimensional key vectors by partitioning rotated data into three-dimensional triplets for joint quantization.
- It employs octahedral mapping to parametrize triplet directions and uses scalar quantization for norms, achieving a non-uniform bit allocation that minimizes mean squared error.
- Empirical results demonstrate lower MSE and improved performance over per-coordinate methods across tasks such as language modeling, video, and audio compression.
Block-wise triplet quantization refers to a joint vector quantization technique for compressing high-dimensional key vectors, principally in attention-based sequence models, that operates by dividing the rotated and normalized feature space into contiguous three-dimensional blocks (“triplets”). Each triplet is then jointly quantized by parametrizing its direction using an octahedral map and its norm as a scalar, optimizing for squared error under data-oblivious distributions derived from random orthogonal rotation. This method achieves higher fidelity than per-coordinate scalar quantization, particularly in low bitwidth regimes, as established in the OCTOPUS codec (Boss et al., 20 May 2026).
1. Rotation Preconditioning and Marginalization
The initial step is to transform the original key vector by separating its Euclidean norm and normalizing it to unit length, , . A randomized, structured orthogonal rotation is constructed as , where is the normalized Walsh–Hadamard matrix and is a sign-flip vector sampled per attention head. This rotation ensures that and can be evaluated in time. The rotated coordinates inherit known symmetric-Beta marginal distributions:
This preconditioning both homogenizes the variance across coordinates and induces analytically tractable marginals necessary for subsequent quantization.
2. Octahedral Parametrization of Triplets
The vector 0 is partitioned into 1 contiguous triplets, 2. Each triplet is further decomposed into its Euclidean norm 3 and its unit direction 4. The unit direction is mapped to a square 5 via a piecewise-linear, equal-area octahedral projection:
- Compute 6 and 7 for 8.
- If 9, 0; else, “fold” the lower hemisphere using 1. The inverse map reconstructs 2 from 3 by “unfolding” and normalizing. This parametrization enables efficient and uniform quantization of 4 directions with minimal distortion.
3. Joint Lloyd–Max Quantization and Bit Allocation
OCTOPUS employs separate 1D Lloyd–Max quantizers for the two octahedral direction parameters and the triplet norm. Codebooks are defined as:
- 5 (shared for 6 and 7), trained on their marginal,
- 8 for 9, trained against 0.
Quantization indices are determined by the mid-point boundaries of sequential centroids. For a triplet, the squared error distortion approximates as:
1
Bit allocation between direction and norm is optimized under an MSE criterion. For 2 bits per direction coordinate and 3 bits for the norm:
4
and the optimal bit-gap is given by
5
where 6 are the direction and norm variances, respectively. Empirically, for 7, 8, 9 is optimal, giving a strictly non-uniform, dimension-dependent bit allocation.
4. Quantization Error Analysis
Expected total MSE for 0 triplets is:
1
This result follows directly from the Panter–Dite high-rate quantization theory, leveraging the rotation-sphere prior induced by 2. The error per triplet is bounded by the sum of norm quantization error and direction quantization error, each weighted appropriately.
5. Encoder/Decoder Workflow and Fused Implementation
The encoding process proceeds per-key as follows:
- Compute 3, 4, 5.
- For each triplet 6: compute 7, map 8 to 9, quantize 0 to indices in 1, and 2 to index in 3. Optionally, a 4 local search over index neighbors refines the indices.
- Pack all direction and norm indices, plus an fp32 value for 5.
The decoder (implemented as a fused split-K flash kernel) performs bit-unpacking, centroid lookup, octahedral inverse, triplet reconstruction, inverse Walsh–Hadamard transform, and remultiplies by 6 (delayed until inside the attention dot-product). The full uncompressed key is never materialized; all operations remain in registers, minimizing memory bandwidth.
A fused Triton kernel implementation reconstructs keys on the fly with no increased decode-time bandwidth or latency compared to per-coordinate dequantization, as only the packed bitstreams and small centroid tables are required.
6. Empirical Results and Comparative Performance
On synthetic Gaussian keys (7), at 4 bits per coordinate, the per-vector MSE with block-wise triplet quantization is 8 lower than per-coordinate TurboQuant and 9 lower than PolarQuant at 2 bits. The addition of a 1-bit QJL residual (OCTOPUS-QJL) yields a 0 lower inner-product error than TurboQuant-QJL.
For long-context LLMs (Qwen2.5-7B on WikiText-2/C4), at 4 bits per coordinate, the perplexity (PPL) gap to fp16 is +2.7% for block-wise triplet quantization, compared to +3.1/4.4/8.0% for TurboQuant-MSE/PolarQuant/TurboQuant-QJL. At 2 bits, the PPL gap is +34.7% (vs. +63/187/772% for baselines), and retrieval recall (1) remains robust, whereas competitors fail.
For autoregressive video (Wan-1.3B DiT), at 2 bits per coordinate, worst-case LPIPS is approximately 2 with block-wise triplet quantization, compared to 3 for TurboQuant-QJL (indicative of visual noise). In next-scale autoregressive audio, OCTOPUS achieves LSD 4 and SNR 5 at 2 bits, while baselines degrade to LSD 6 and negative SNR.
Block-wise triplet quantization strictly outperforms prior per-coordinate codecs (TurboQuant, PolarQuant) in the extreme-compression regime (2–3 bits per coordinate), retaining a (smaller) lead at 4 bits. The addition of QJL further improves unbiased inner-product estimation at a modest bit overhead.
7. Summary and Implications
Block-wise triplet quantization, as formalized in OCTOPUS, generalizes rotation–Lloyd–Max codecs by partitioning the compressed feature space into three-dimensional blocks and encoding their norm and direction jointly via octahedral mapping. This achieves a dimension-dependent, strictly non-uniform bit allocation (7 bits for each direction parameter, 8 for the norm), which is found to be MSE-optimal and data-oblivious. The fused implementation strategy, notably using Triton kernels, provides high computational efficiency with negligible extra peak memory and achieves 9–0 higher compression ratios than baselines. Across modalities and tasks, block-wise triplet quantization is strictly superior in low-bit regimes and never worse than existing rotation-based quantizers at high bit widths (Boss et al., 20 May 2026).