Papers
Topics
Authors
Recent
2000 character limit reached

Pairwise Rotation Quantization (ParoQuant)

Updated 16 November 2025
  • Pairwise Rotation Quantization (ParoQuant) is a technique that leverages parameterized Givens rotations in high-dimensional spaces to decorrelate, rescale, and optimize vector and matrix quantization.
  • It utilizes block coordinate descent with analytic Givens updates to achieve differentiable, parallelizable rotation learning that minimizes quantization distortion and outlier effects.
  • ParoQuant is applied in trainable embedding indexes, transformer weight quantization, binary hashing for ANN, and key-value cache compression, consistently improving accuracy and computational efficiency.

Pairwise Rotation Quantization (ParoQuant) encompasses a family of techniques that utilize parameterized rotations—primarily Givens rotations—in high-dimensional spaces to decorrelate, rescale, and optimize the quantization of vectors and matrices. This approach has found application across trainable embedding indexes for retrieval, weight-only post-training quantization in transformer models, binary hashing for large-scale approximate nearest neighbor (ANN) search, and key-value cache compression in LLMs. At its core, ParoQuant leverages the structure of the special orthogonal group SO(n) and Lie group geometry to enable efficient, differentiable, and hardware-parallelizable transformations prior to quantization, substantially mitigating distortion and dynamic range issues introduced by outliers.

1. Mathematical Foundations: Givens Rotations and Parameterization of SO(n)

A Givens rotation is an elementary orthogonal transformation acting in a two-dimensional subspace (i,j) of ℝⁿ:

Gi,j(θ)=Inwith the  (i,j)  block  (cosθsinθ sinθcosθ)G_{i,j}(\theta) = I_n \quad \text{with the} \; (i,j) \; \text{block} \; \begin{pmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{pmatrix}

Any rotation matrix RSO(n)R \in SO(n) can be decomposed (non-uniquely) as a product of at most n(n1)/2n(n-1)/2 Givens rotations (Jiang et al., 2022). This block structure admits selection of disjoint pairs of indices for parallel application:

R=(i,j)SGi,j(θi,j)R = \prod_{(i,j)\in S} G_{i,j}(\theta_{i,j})

Crucially, this parameterization supports efficient, differentiable coordinate-wise updates. Givens rotations preserve strict orthonormality and allow composition of sparse, commutative blocks, facilitating parallelization during quantization and optimization. Pairwise rotation also underlies sparse hashing transforms and the polar quantization of 2d key-groups after rotary position embedding (RoPE), further generalizing its impact (Ishikawa et al., 2015, Wu et al., 1 Feb 2025).

2. Quantization Distortion Objectives and Rotation Optimization

Quantization seeks to minimize the discrepancy between continuous vectors XX and their quantized images ϕ(X)\phi(X). When a learnable rotation RR precedes quantization, the transformed objective is:

L(X;R)=Lret(T(X))+1mXRϕ(XR)F2L(X; R) = L_\text{ret}(T(X)) + \frac{1}{m} \|X R - \phi(X R)\|_F^2

where T(X)=ϕ(XR)RT(X) = \phi(X R) R^\top. LretL_\text{ret} denotes retrieval loss (e.g., cross-entropy, hinge loss), and the second term quantifies Euclidean distortion induced by PQ or scalar quantization (Jiang et al., 2022). For Pairwise Quantization (PairQ), loss functions may additionally target pairwise inner products or distances:

Lsp=i,j(qiTxjqiTx^j)2=j(xjx^j)TG(xjx^j)L_{sp} = \sum_{i,j} \left(q_i^T x_j - q_i^T \hat x_j\right)^2 = \sum_j (x_j - \hat x_j)^T G (x_j - \hat x_j)

where G=iqiqiTG = \sum_i q_i q_i^T acts as a symmetric metric, yielding a transformed distortion objective in the latent basis (Babenko et al., 2016). PolarQuant extends this formulation by representing pairs as (r,θ)(r, \theta) in polar coordinates and performing quantization per block, dramatically reducing outlier impact (Wu et al., 1 Feb 2025).

3. Block Coordinate Descent Methods for Rotation Learning

Learning rotations within SO(n) involves optimizing over a nonlinear manifold. ParoQuant employs block coordinate descent (GCD) using analytic updates for each Givens block:

  • Compute G=L(XR)G = \frac{\partial L}{\partial (XR)}
  • Form skew-symmetric A=GRRGA = G^\top R - R^\top G
  • For disjoint pairs (i,j)(i,j), compute gradient gij=Aij2g_{ij} = \frac{A_{ij}}{\sqrt{2}}
  • Update RRl=1n/2Gil,jl(λgil,jl)R \gets R \prod_{l=1}^{n/2} G_{i_l, j_l}(-\lambda g_{i_l, j_l})

No re-orthogonalization is required, as each update preserves group structure (Jiang et al., 2022). Selection of pairs can be random (GCD-R), by descending Aij|A_{ij}| (GCD-G), or by matching algorithms (GCD-S). For ParoQuant in PTQ, sequences of KK sparse, independent rotations, each involving Ng/2N \approx g/2 channel pairs in blocks of size gg, are jointly optimized with diagonal scaling to minimize post-quantization output error (Liang et al., 13 Nov 2025).

4. Integration into Retrieval Models and Weight-Only Quantization

In retrieval architectures, ParoQuant inserts the learned rotation before product quantization and inverts it post-index retrieval, interleaving network parameter updates, PQ centroid updates, and SO(n) rotation steps in a single autodiff loop (Jiang et al., 2022). In weight-only PTQ for LLMs, ParoQuant applies invertible scaled pairwise rotations and scaling to weight blocks:

  1. Transform with T=RKR1ST = R_K \cdots R_1 \cdot S
  2. Quantize weights with per-block linear quantizer
  3. At inference, invert rotations and scaling prior to GEMM

This co-design enables batching of rotations and scaling in a single, GPU-fused kernel, achieving runtime overhead below 10% and maximizing parallel efficiency (Liang et al., 13 Nov 2025).

5. Dynamic Range Reduction and Outlier Suppression

A central motivation for ParoQuant is the mitigation of outliers, which in conventional quantization inflate dynamic ranges and degrade accuracy. Pairwise rotations redistribute magnitude, compressing outlier-heavy channels into the main bulk, narrowing per-block dynamic range by 20–50% (Liang et al., 13 Nov 2025). Channel-wise scaling further homogenizes row norms. In PolarQuant, the transformation to (r,θ)(r, \theta) disperses outlier values in (x1,x2)(x_1, x_2) into moderate radii and wrap-around angles, yielding well-behaved histograms suitable for aggressive quantization (Wu et al., 1 Feb 2025).

6. Empirical Performance and Computational Efficiency

Experimental results across domains consistently demonstrate ParoQuant's empirical benefits:

Task ParoQuant Metrics Baseline Comparison
SIFT1M retrieval (Jiang et al., 2022) 5–10% lower distortion, ~0.2% p@100 gain Comparable to OPQ-SVD
Reasoning LLMs (Liang et al., 13 Nov 2025) Avg. –0.9% accuracy drop vs. FP16; 2.4% better than AWQ AWQ drops –2.8% accuracy
Decoding throughput (Liang et al., 13 Nov 2025) 1.6–2.6× speedup over QTIP, 10% overhead vs. AWQ QTIP 15–30% slower
Key cache quantization (Wu et al., 1 Feb 2025) KV cache at 4.16 bits/scalar, 1.27× faster dot-products vs. FP16 GEMM KVQuant/KIVI slower

For rotation learning, block GCD achieves per-iteration complexity O(n2)O(n^2) (fully parallelizable), and GCD-R runs tens of times faster than SVD or Cayley on GPU (Jiang et al., 2022). In weight-only PTQ, ParoQuant's fused transform kernel outpaces Hadamard-based transforms by 3–5×, and calibration converges in \approx200 steps for 128 samples (Liang et al., 13 Nov 2025). Pairwise Quantization (PairQ) is unbiased for scalar products and squared distances, surpassing OPQ in error reduction by 15–43% across varying byte budgets (Babenko et al., 2016).

7. Theoretical Guarantees and Implementation Considerations

Theoretical results ensure that block GCD yields sublinear convergence O(1/k)O(1/k) in geodesically convex objectives on SO(n) (Jiang et al., 2022). PairQ recasts complex pairwise objectives as Euclidean reconstruction in transformed space, establishing zero bias for both scalar products and distances (Babenko et al., 2016). Sparse pairwise rotation hashing attains O(nlogn)O(n \log n) encoding cost, outperforming dense schemes in high-dimensional visual search (Ishikawa et al., 2015). Practical deployments benefit from automatic structure preservation in SO(n), compatibility with autodiff, and effective scaling to large blocks and GPU workloads.

Salient implementation guidelines include choice of rotation block size (g=128g = 128 for PTQ), count of independent rotations (K=8K = 8 for LLMs), selection of disjoint pairs for maximal parallelism, and calibration/fine-tuning schedules. For hashing, the factorization list is stored sparsely, each factor applied as an in-place update, and encoding complexity remains O(nlogn)O(n \log n) due to log-linear factor count. For quantization of LLM KV caches, encoding and decoding per 2d block in (r,θ)(r, \theta), and leveraging lookup-table inner products, yields maximal efficiency with minimal downstream performance loss (Wu et al., 1 Feb 2025).

Summary and Contextual Position

Pairwise Rotation Quantization, across variants such as ParoQuant, PolarQuant, sparse rotation hashing, and pairwise quantization, provides a mathematically principled, computationally efficient solution for high-quality quantization in the presence of outlier-induced dynamic range. Its foundations in Givens rotations and SO(n) enable differentiable, group-consistent transformations. Empirically, ParoQuant approaches or exceeds state-of-the-art retrieval, accuracy, and throughput metrics in ANN, LLM inference, and large-scale visual data search (Jiang et al., 2022, Liang et al., 13 Nov 2025, Wu et al., 1 Feb 2025, Babenko et al., 2016, Ishikawa et al., 2015). This broad applicability attests to its centrality in modern quantization and indexing pipelines.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Pairwise Rotation Quantization (ParoQuant).