Papers
Topics
Authors
Recent
Search
2000 character limit reached

Block-wise Triplet Quantization

Updated 26 May 2026
  • Block-wise triplet quantization is a method of compressing high-dimensional key vectors by partitioning rotated data into three-dimensional triplets for joint quantization.
  • It employs octahedral mapping to parametrize triplet directions and uses scalar quantization for norms, achieving a non-uniform bit allocation that minimizes mean squared error.
  • Empirical results demonstrate lower MSE and improved performance over per-coordinate methods across tasks such as language modeling, video, and audio compression.

Block-wise triplet quantization refers to a joint vector quantization technique for compressing high-dimensional key vectors, principally in attention-based sequence models, that operates by dividing the rotated and normalized feature space into contiguous three-dimensional blocks (“triplets”). Each triplet is then jointly quantized by parametrizing its direction using an octahedral map and its norm as a scalar, optimizing for squared error under data-oblivious distributions derived from random orthogonal rotation. This method achieves higher fidelity than per-coordinate scalar quantization, particularly in low bitwidth regimes, as established in the OCTOPUS codec (Boss et al., 20 May 2026).

1. Rotation Preconditioning and Marginalization

The initial step is to transform the original key vector kRdk \in \mathbb{R}^d by separating its Euclidean norm and normalizing it to unit length, y=k2y = \|k\|_2, u~=k/ySd1\tilde{u} = k/y \in S^{d-1}. A randomized, structured orthogonal rotation is constructed as R=Hdiag(s)R = H\,\mathrm{diag}(s), where HH is the normalized Walsh–Hadamard matrix and s{±1}ds \in \{\pm1\}^d is a sign-flip vector sampled per attention head. This rotation ensures that RR=IR^\top R=I and can be evaluated in O(dlogd)O(d\log d) time. The rotated coordinates u=Ru~u = R\tilde{u} inherit known symmetric-Beta marginal distributions:

f(uj)(1uj2)(d3)/2,uj[1,1].f(u_j) \propto (1-u_j^2)^{(d-3)/2}, \quad u_j \in [-1,1].

This preconditioning both homogenizes the variance across coordinates and induces analytically tractable marginals necessary for subsequent quantization.

2. Octahedral Parametrization of Triplets

The vector y=k2y = \|k\|_20 is partitioned into y=k2y = \|k\|_21 contiguous triplets, y=k2y = \|k\|_22. Each triplet is further decomposed into its Euclidean norm y=k2y = \|k\|_23 and its unit direction y=k2y = \|k\|_24. The unit direction is mapped to a square y=k2y = \|k\|_25 via a piecewise-linear, equal-area octahedral projection:

  • Compute y=k2y = \|k\|_26 and y=k2y = \|k\|_27 for y=k2y = \|k\|_28.
  • If y=k2y = \|k\|_29, u~=k/ySd1\tilde{u} = k/y \in S^{d-1}0; else, “fold” the lower hemisphere using u~=k/ySd1\tilde{u} = k/y \in S^{d-1}1. The inverse map reconstructs u~=k/ySd1\tilde{u} = k/y \in S^{d-1}2 from u~=k/ySd1\tilde{u} = k/y \in S^{d-1}3 by “unfolding” and normalizing. This parametrization enables efficient and uniform quantization of u~=k/ySd1\tilde{u} = k/y \in S^{d-1}4 directions with minimal distortion.

3. Joint Lloyd–Max Quantization and Bit Allocation

OCTOPUS employs separate 1D Lloyd–Max quantizers for the two octahedral direction parameters and the triplet norm. Codebooks are defined as:

  • u~=k/ySd1\tilde{u} = k/y \in S^{d-1}5 (shared for u~=k/ySd1\tilde{u} = k/y \in S^{d-1}6 and u~=k/ySd1\tilde{u} = k/y \in S^{d-1}7), trained on their marginal,
  • u~=k/ySd1\tilde{u} = k/y \in S^{d-1}8 for u~=k/ySd1\tilde{u} = k/y \in S^{d-1}9, trained against R=Hdiag(s)R = H\,\mathrm{diag}(s)0.

Quantization indices are determined by the mid-point boundaries of sequential centroids. For a triplet, the squared error distortion approximates as:

R=Hdiag(s)R = H\,\mathrm{diag}(s)1

Bit allocation between direction and norm is optimized under an MSE criterion. For R=Hdiag(s)R = H\,\mathrm{diag}(s)2 bits per direction coordinate and R=Hdiag(s)R = H\,\mathrm{diag}(s)3 bits for the norm:

R=Hdiag(s)R = H\,\mathrm{diag}(s)4

and the optimal bit-gap is given by

R=Hdiag(s)R = H\,\mathrm{diag}(s)5

where R=Hdiag(s)R = H\,\mathrm{diag}(s)6 are the direction and norm variances, respectively. Empirically, for R=Hdiag(s)R = H\,\mathrm{diag}(s)7, R=Hdiag(s)R = H\,\mathrm{diag}(s)8, R=Hdiag(s)R = H\,\mathrm{diag}(s)9 is optimal, giving a strictly non-uniform, dimension-dependent bit allocation.

4. Quantization Error Analysis

Expected total MSE for HH0 triplets is:

HH1

This result follows directly from the Panter–Dite high-rate quantization theory, leveraging the rotation-sphere prior induced by HH2. The error per triplet is bounded by the sum of norm quantization error and direction quantization error, each weighted appropriately.

5. Encoder/Decoder Workflow and Fused Implementation

The encoding process proceeds per-key as follows:

  1. Compute HH3, HH4, HH5.
  2. For each triplet HH6: compute HH7, map HH8 to HH9, quantize s{±1}ds \in \{\pm1\}^d0 to indices in s{±1}ds \in \{\pm1\}^d1, and s{±1}ds \in \{\pm1\}^d2 to index in s{±1}ds \in \{\pm1\}^d3. Optionally, a s{±1}ds \in \{\pm1\}^d4 local search over index neighbors refines the indices.
  3. Pack all direction and norm indices, plus an fp32 value for s{±1}ds \in \{\pm1\}^d5.

The decoder (implemented as a fused split-K flash kernel) performs bit-unpacking, centroid lookup, octahedral inverse, triplet reconstruction, inverse Walsh–Hadamard transform, and remultiplies by s{±1}ds \in \{\pm1\}^d6 (delayed until inside the attention dot-product). The full uncompressed key is never materialized; all operations remain in registers, minimizing memory bandwidth.

A fused Triton kernel implementation reconstructs keys on the fly with no increased decode-time bandwidth or latency compared to per-coordinate dequantization, as only the packed bitstreams and small centroid tables are required.

6. Empirical Results and Comparative Performance

On synthetic Gaussian keys (s{±1}ds \in \{\pm1\}^d7), at 4 bits per coordinate, the per-vector MSE with block-wise triplet quantization is s{±1}ds \in \{\pm1\}^d8 lower than per-coordinate TurboQuant and s{±1}ds \in \{\pm1\}^d9 lower than PolarQuant at 2 bits. The addition of a 1-bit QJL residual (OCTOPUS-QJL) yields a RR=IR^\top R=I0 lower inner-product error than TurboQuant-QJL.

For long-context LLMs (Qwen2.5-7B on WikiText-2/C4), at 4 bits per coordinate, the perplexity (PPL) gap to fp16 is +2.7% for block-wise triplet quantization, compared to +3.1/4.4/8.0% for TurboQuant-MSE/PolarQuant/TurboQuant-QJL. At 2 bits, the PPL gap is +34.7% (vs. +63/187/772% for baselines), and retrieval recall (RR=IR^\top R=I1) remains robust, whereas competitors fail.

For autoregressive video (Wan-1.3B DiT), at 2 bits per coordinate, worst-case LPIPS is approximately RR=IR^\top R=I2 with block-wise triplet quantization, compared to RR=IR^\top R=I3 for TurboQuant-QJL (indicative of visual noise). In next-scale autoregressive audio, OCTOPUS achieves LSD RR=IR^\top R=I4 and SNR RR=IR^\top R=I5 at 2 bits, while baselines degrade to LSD RR=IR^\top R=I6 and negative SNR.

Block-wise triplet quantization strictly outperforms prior per-coordinate codecs (TurboQuant, PolarQuant) in the extreme-compression regime (2–3 bits per coordinate), retaining a (smaller) lead at 4 bits. The addition of QJL further improves unbiased inner-product estimation at a modest bit overhead.

7. Summary and Implications

Block-wise triplet quantization, as formalized in OCTOPUS, generalizes rotation–Lloyd–Max codecs by partitioning the compressed feature space into three-dimensional blocks and encoding their norm and direction jointly via octahedral mapping. This achieves a dimension-dependent, strictly non-uniform bit allocation (RR=IR^\top R=I7 bits for each direction parameter, RR=IR^\top R=I8 for the norm), which is found to be MSE-optimal and data-oblivious. The fused implementation strategy, notably using Triton kernels, provides high computational efficiency with negligible extra peak memory and achieves RR=IR^\top R=I9–O(dlogd)O(d\log d)0 higher compression ratios than baselines. Across modalities and tasks, block-wise triplet quantization is strictly superior in low-bit regimes and never worse than existing rotation-based quantizers at high bit widths (Boss et al., 20 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Block-wise Triplet Quantization.