RaBitQ: Rotation-Based Quantization for ANN Search

Updated 4 July 2026

RaBitQ is a rotation-based quantization method for high-dimensional vectors, designed to compress data and accurately estimate inner products in ANN systems.
It uses random orthogonal rotations, bi-valued codebooks, and scalar corrections to achieve unbiased similarity estimates with sharp high-probability error bounds.
Extensions like ExRaBitQ and RaBitQCache integrate the method into GPU/NPU pipelines and LLM inference, enhancing speed, memory efficiency, and retrieval accuracy.

RaBitQ is a rotation-based quantization method for high-dimensional vectors, developed for approximate nearest neighbor search in Euclidean space and later generalized into a broader family of inner-product and distance sketches for retrieval systems, GPU/NPU ANN pipelines, and long-context LLM inference. In its original ANN formulation, RaBitQ quantizes a $D$ -dimensional vector into a $D$ -bit string, stores a small scalar correction term, and uses an unbiased estimator with a sharp high-probability error bound; later work extends the construction to arbitrary $B$ bits per dimension, integrates it with IVF on GPUs and NPUs, and repurposes rotated binary quantization as an attention proxy in KV-cache sparsification (Gao et al., 2024, Gao et al., 2024, Shi et al., 27 Feb 2026, He et al., 15 May 2026, Li et al., 30 Jun 2026).

1. Origins and problem setting

RaBitQ was introduced for approximate nearest neighbor (ANN) query processing in high-dimensional Euclidean space, where compressing vectors and estimating distances quickly are central system concerns. The 2024 paper "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search" explicitly motivates the method by contrasting it with Product Quantization and related variants, which were described there as efficient in practice but lacking a theoretical error bound and sometimes failing disastrously on real-world datasets (Gao et al., 2024).

The original construction targets the setting in which raw vectors $o_r \in \mathbb{R}^D$ are centered around a reference point $c$ , normalized, randomly rotated, and then mapped onto a bi-valued codebook. The objective is not reconstruction in the usual lossy-compression sense, but accurate estimation of the inner product between normalized residual directions, because the squared Euclidean distance after centering decomposes into norms and a cosine term. This design choice makes RaBitQ a retrieval-oriented quantizer rather than a generic reconstruction quantizer in the narrow sense used by some scalar or product codebooks (Gao et al., 2024).

Subsequent work broadened both the algorithmic scope and the deployment regimes. Gao & Long et al. introduced ExRaBitQ as an extension that supports arbitrary $B$ bits per dimension rather than only the original 1-bit regime, while preserving the theoretical guarantees and proving asymptotic optimality for the space–error trade-off (Gao et al., 2024). Later system papers embedded RaBitQ in GPU-native IVF pipelines, in a heterogeneous NPU–CPU IVF-RaBitQ architecture, and in GPU-native graph ANN systems such as Jasper (Shi et al., 27 Feb 2026, He et al., 15 May 2026, McCoy et al., 11 Jan 2026). A separate line of work then adapted rotated binary quantization to KV-cache sparsification for long-context LLM inference under the name RaBitQCache (Li et al., 30 Jun 2026).

2. Core quantization constructions

In the original ANN formulation, the codebook is the hypercube

$C = \{\pm 1/\sqrt{D}\}^D,$

and a random orthogonal rotation $P \in \mathbb{R}^{D \times D}$ is sampled. For each normalized data vector $o$ , RaBitQ finds the nearest codeword in the rotated codebook

$C_r = \{P x : x \in C\},$

equivalently by computing $D$ 0 and taking the sign of each coordinate. The stored representation consists of the corresponding $D$ 1-bit pattern and a scalar correction term such as $D$ 2, where $D$ 3 denotes the selected codeword (Gao et al., 2024).

At query time, the method rotates the normalized query $D$ 4 into $D$ 5, evaluates the inner product between $D$ 6 and the stored codeword, and forms the estimator

$D$ 7

This estimator is then substituted into the centered Euclidean distance formula. The paper also gives a query-side randomized uniform scalar quantization scheme for $D$ 8, so the inner product $D$ 9 can be computed with bitwise operations or SIMD-based operations (Gao et al., 2024).

Later comparative work presents a more generic RaBitQ formulation for inner-product sketching. There, a random orthonormal matrix $B$ 0 is applied once to every data vector, producing $B$ 1. With bit-width $B$ 2 and a scaling factor $B$ 3, the quantizer stores

$B$ 4

together with one extra scalar

$B$ 5

Given a query $B$ 6, the same rotation yields $B$ 7, and the decode-free estimator is

$B$ 8

This formulation emphasizes that RaBitQ requires no codebook lookup or floating-point decoding of $B$ 9; the implementation uses integer multiplies and adds plus two scalar multiplies (Gao et al., 21 Apr 2026).

The IVF-oriented 1-bit residual variant used in Ascend-RaBitQ is more specialized. For a vector $o_r \in \mathbb{R}^D$ 0 assigned to centroid $o_r \in \mathbb{R}^D$ 1, the method rotates both vectors, quantizes the residual by

$o_r \in \mathbb{R}^D$ 2

and precomputes two scalar constants: $o_r \in \mathbb{R}^D$ 3 The approximate distance for a query $o_r \in \mathbb{R}^D$ 4 in the same cluster is

$o_r \in \mathbb{R}^D$ 5

Using the identity between signed Hamming distance and inner product, the system reduces coarse ranking to one dense matrix–vector inner product plus a per-vector constant, which is the kernel mapped to NPU Cube Units (He et al., 15 May 2026).

3. Theoretical guarantees and rate–error trade-offs

A defining property of RaBitQ is that the similarity estimator is unbiased. In the original 1-bit ANN paper, the estimator $o_r \in \mathbb{R}^D$ 6 satisfies

$o_r \in \mathbb{R}^D$ 7

and the paper gives a high-probability additive error bound of the form

$o_r \in \mathbb{R}^D$ 8

for some absolute $o_r \in \mathbb{R}^D$ 9. The resulting error scale is therefore $c$ 0, and this carries over to the induced squared-distance estimator after reinserting norms (Gao et al., 2024).

The later symmetric comparison with TurboQuant states the guarantees in a bit-complexity form. There, RaBitQ’s inner-product estimator is again described as unbiased,

$c$ 1

and is said to match the optimal sub-Gaussian tail behavior up to constants: $c$ 2 for some $c$ 3. Consequently, the required bit-width to ensure $c$ 4-error with confidence $c$ 5 is

$c$ 6

The same note contrasts this with TurboQuant’s variance-only control, which via Chebyshev’s inequality forces the suboptimal scaling $c$ 7 (Gao et al., 21 Apr 2026).

ExRaBitQ generalizes the original construction to arbitrary $c$ 8 bits per dimension using a uniform integer grid

$c$ 9

normalized and then rotated. The key result is a space–error optimality statement: when $B$ 0, choosing

$B$ 1

suffices to guarantee additive error at most $B$ 2 with probability at least $B$ 3, matching the relevant lower bound up to constants (Gao et al., 2024).

Theoretical comparisons after 2024 stress that RaBitQ’s advantages are criterion-dependent rather than absolute. "Block-Sphere Vector Quantization" states that EDEN and TurboQuant are favorable for MSE distortion, EDEN is also effective for expected inner-product distortion, and RabitQ provides strong high-probability control (Ann et al., 19 May 2026). A plausible implication is that RaBitQ’s main theoretical identity is not universal dominance under every distortion objective, but unusually strong control in the high-probability regime most directly aligned with retrieval guarantees.

4. System realizations for ANN search

RaBitQ has been realized in several distinct ANN systems, each exploiting a different systems-level bottleneck.

System	Context	Reported result
ExRaBitQ	Arbitrary- $B$ 4 ANN with IVF and two-stage pruning	$B$ 5 bits $B$ 6 Recall $B$ 7; $B$ 8 bits $B$ 9; $C = \{\pm 1/\sqrt{D}\}^D,$ 0 bits $C = \{\pm 1/\sqrt{D}\}^D,$ 1 without re-ranking
Jasper	GPU-native Vamana with RaBitQ-compressed vectors	Up to $C = \{\pm 1/\sqrt{D}\}^D,$ 2 memory compression; up to $C = \{\pm 1/\sqrt{D}\}^D,$ 3 higher query throughput than CAGRA
IVF-RaBitQ (GPU)	GPU-native IVF integrated into NVIDIA cuVS Library	At Recall approximately equal to $C = \{\pm 1/\sqrt{D}\}^D,$ 4, $C = \{\pm 1/\sqrt{D}\}^D,$ 5 higher QPS than CAGRA; indices $C = \{\pm 1/\sqrt{D}\}^D,$ 6 faster to construct on average
Ascend-RaBitQ	Heterogeneous NPU–CPU optimized IVF-RaBitQ	$C = \{\pm 1/\sqrt{D}\}^D,$ 7 to $C = \{\pm 1/\sqrt{D}\}^D,$ 8 faster index construction than CPU baseline; up to $C = \{\pm 1/\sqrt{D}\}^D,$ 9 throughput improvement over the fastest CPU IVF-RaBitQ

ExRaBitQ retains the RaBitQ estimator structure while replacing the 1-bit hypercube with an arbitrary- $P \in \mathbb{R}^{D \times D}$ 0 grid codebook and a critical-scale enumeration algorithm. The quantizer searches only $P \in \mathbb{R}^{D \times D}$ 1 critical scales, with practical complexity $P \in \mathbb{R}^{D \times D}$ 2 for $P \in \mathbb{R}^{D \times D}$ 3 and $P \in \mathbb{R}^{D \times D}$ 4. At query time it uses a two-stage scheme: first the most significant bits are used for a cheap estimate and pruning, then the remaining bits refine surviving candidates. The same paper reports that ExRaBitQ(4) runs about $P \in \mathbb{R}^{D \times D}$ 5 faster than LVQ(4) at the same recall, and on MSMARCO-100M ExRaBitQ(5) achieves $P \in \mathbb{R}^{D \times D}$ 6 recall at about $P \in \mathbb{R}^{D \times D}$ 7 QPS using $P \in \mathbb{R}^{D \times D}$ 8 GB rather than raw $P \in \mathbb{R}^{D \times D}$ 9 GB (Gao et al., 2024).

Jasper incorporates RaBitQ into a GPU-native Vamana graph index. Its rationale is explicitly architectural: product quantization introduces random one-byte table lookups that interact poorly with GPU caches, whereas RaBitQ stores $o$ 0 bits plus two 32-bit metadata floats per vector in a sequential, coalesced bit-packed layout. Jasper reports up to $o$ 1 memory compression when $o$ 2, up to $o$ 3 higher query throughput than CAGRA, average construction $o$ 4 faster than CAGRA, and 19–131x faster queries than BANG; on the 960-dimensional Gist dataset, RaBitQ reaches $o$ 5 M queries/sec versus exact search at approximately $o$ 6 M queries/sec at Recall@50 $o$ 7 (McCoy et al., 11 Jan 2026).

The GPU-native IVF-RaBitQ system integrated into the NVIDIA cuVS Library couples balanced k-means, GPU-native RaBitQ quantization, and a fused cluster-local search kernel. The build pipeline quantizes one coarse cluster at a time; the search kernel combines 1-bit filtering, candidate selection, refined distance evaluation on ex-codes, and in-block top- $o$ 8 in one kernel. The paper reports that, at Recall approximately equal to $o$ 9, IVF-RaBitQ achieves $C_r = \{P x : x \in C\},$ 0 higher QPS than CAGRA and constructs indices $C_r = \{P x : x \in C\},$ 1 faster on average; compared to IVF-PQ, it delivers on average over $C_r = \{P x : x \in C\},$ 2 higher throughput while avoiding accessing the raw vectors for reranking (Shi et al., 27 Feb 2026).

Ascend-RaBitQ is the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search. Its pipeline is explicitly split into cluster probing on CPU, coarse ranking on NPU AI Core plus AI CPU, and fine re-ranking on host CPU. The architecture-native optimizations are fourfold: fused AIC-AIV operators for parallel distance computation, computation-flow restructuring exploiting rotation orthogonality, fine-grained index block-level load balancing that breaks query boundaries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask Top-k latency. On SIFT1B, the ablation reports cumulative speedups of $C_r = \{P x : x \in C\},$ 3 from load balancing, $C_r = \{P x : x \in C\},$ 4 after adding FastScan gather, $C_r = \{P x : x \in C\},$ 5 after AI Core↔AI CPU pipeline parallelism, and $C_r = \{P x : x \in C\},$ 6 after moving re-ranking to CPU. On multi-NPU SIFT1B runs, the coarse distance stage scales near linearly to 8 NPUs at approximately $C_r = \{P x : x \in C\},$ 7, while end-to-end speedup at 8 NPUs is approximately $C_r = \{P x : x \in C\},$ 8, bottlenecked by host CPU re-ranking (He et al., 15 May 2026).

5. Empirical comparisons, methodological disputes, and reproducibility

RaBitQ’s empirical position in the literature is strongest when comparisons are made under symmetric protocols. "Revisiting RaBitQ and TurboQuant" evaluates RaBitQ on DBpedia-Entities for inner-product error and on GloVe-200, OpenAI3-1536, and OpenAI3-3072 for quantization time and $C_r = \{P x : x \in C\},$ 9-NN recall, using an NVIDIA A100 GPU, a dual-socket Intel Xeon Gold 6418H CPU with 48 cores, the C++ RaBitQ implementation from VectorDB-NTU/RaBitQ-Library, and the official TurboQuant PyTorch code (Gao et al., 21 Apr 2026).

Under that setup, the note reports that across $D$ 00, RaBitQ maintains mean approximately $D$ 01, lower standard deviation, and lower max error for all $D$ 02 in the unbiased inner-product mode. For quantization time on 100,000 vectors at 4 bits, RaBitQ on A100 GPU is reported at $D$ 03 for $D$ 04 and $D$ 05 for $D$ 06, versus TurboQuant at $D$ 07 and $D$ 08, respectively, or approximately $D$ 09 slower. For Recall@1@ $D$ 10, averaged over 10 random rotations, RaBitQ is reported as consistently yielding higher recall across all $D$ 11, datasets, and bit-widths, with the largest gaps at $D$ 12 and small $D$ 13 (Gao et al., 21 Apr 2026).

The same note also documents explicit reproducibility issues. In private correspondence, the TurboQuant authors reportedly confirmed that the RaBitQ baseline had been run on a single-core CPU with multithreading disabled using a Python prototype, while TurboQuant itself had been run on an A100 GPU. The note concludes that the reported "6× memory reduction, 8× speedup, zero accuracy loss" claims are not reproducible under a fair comparison, and further states that its own reruns produced quantization times up to two orders of magnitude slower than reported and recall curves outside the published bands (Gao et al., 21 Apr 2026).

A separate misconception addressed by later theory is the idea that RaBitQ is uniformly superior across all distortion notions. "Block-Sphere Vector Quantization" places EDEN, TurboQuant, and RabitQ in a unified comparison and states that the relative advantages are criterion-dependent rather than absolute. Specifically, EDEN and TurboQuant are favorable for MSE distortion, EDEN is also effective for expected inner-product distortion, and RabitQ provides strong high-probability control. This suggests that RaBitQ’s empirical and theoretical appeal is most coherent when the target objective is retrieval fidelity under high-probability guarantees rather than minimum reconstruction MSE alone (Ann et al., 19 May 2026).

6. Extensions beyond classical ANN

RaBitQ has also been extended into long-context LLM inference through RaBitQCache. In that framework, the dominant operation is computing $D$ 14 for a new query against a large cache of keys. The method centers and normalizes queries and keys by prefill centroids, applies a random orthogonal rotation $D$ 15, and quantizes keys to 1-bit sign codes

$D$ 16

For each key, the system stores the binary tensor $D$ 17 and a scalar correction

$D$ 18

The estimator

$D$ 19

is used as a proxy for cosine similarity, and the appendix theorem states

$D$ 20

with error $D$ 21 with high probability (Li et al., 30 Jun 2026).

A system consequence of this unbiased proxy is adaptive Top- $D$ 22 retrieval rather than fixed-budget Top- $D$ 23. RaBitQCache forms proxy softmax masses

$D$ 24

and selects the minimal prefix whose cumulative mass exceeds $D$ 25. The implementation uses asynchronous pipelined prefill, lazy decode-time updates, and an INT4 $D$ 26 1-bit GEMV kernel. Reported results include less than $D$ 27 prefill overhead over full FlashAttention, up to $D$ 28 decode speedup at 30K context, and $D$ 29 end-to-end acceleration on LongBench workloads with no loss in generation quality. On LongBench with LLaMA-8B at $D$ 30, the method visits only $D$ 31 of keys yet retains $D$ 32 of full-attention mass; on GSM8K it reports $D$ 33 versus $D$ 34 for full attention while recalling $D$ 35 of attention mass (Li et al., 30 Jun 2026).

RaBitQ also appears in broader comparisons of quantizers for embeddings and KV-cache compression. In "Block-Sphere Vector Quantization," RaBitQ is evaluated on embedding distortion, nearest-neighbor recall, and KV-cache quantization for Llama-3.1-8B-Instruct at 3.5 bits effective. The paper reports that on the "Needle-in-a-Haystack" benchmark, RaBitQ scores $D$ 36, compared with EDEN at $D$ 37 and TurboQuant at $D$ 38; on LongBench-E, the averages are $D$ 39 for RaBitQ, $D$ 40 for EDEN, $D$ 41 for TurboQuant, and $D$ 42 for full precision (Ann et al., 19 May 2026).

Taken together, these developments place RaBitQ as a family of rotation-based quantizers with three recurring characteristics: implicit or extremely compact codebooks, unbiased similarity estimation, and a close fit to hardware-friendly integer or bitwise kernels. The family now spans 1-bit ANN search, arbitrary-bit asymptotically optimal extensions, GPU- and NPU-native retrieval systems, and sparse-attention proxies for long-context inference. It should also be distinguished from the unrelated LLM-weight quantization framework "RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs," whose subject is residual binarization of model weights rather than vector search or similarity sketching (You et al., 5 Feb 2026).