1-Bit Quantized JL Transform

Updated 26 March 2026

The paper introduces a 1-bit Quantized JL transform that leverages a Gaussian JL projection followed by sign-bit quantization to create compact binary sketches preserving inner product structure.
It provides unbiased inner-product estimators with controlled variance and distortion, achieving near-optimal performance with minimal storage via a two-stage process.
Practical implementations demonstrate significant KV cache compression and accelerated inference through GPU-optimized batch processing and hybrid quantization strategies.

A 1-bit Quantized Johnson–Lindenstrauss (QJL) transform is a data-oblivious compression technique that creates ultra-compact binary sketches of high-dimensional vectors while preserving geometric structure for inner product estimation. QJL leverages a random Gaussian projection (Johnson–Lindenstrauss transform) followed by sign-bit (1-bit) quantization, allowing unbiased and low-variance estimation of inner products and $\ell_2$ norms. It achieves near-optimal distortion at minimal storage and computational cost, with particular efficacy in compressing neural network Key-Value (KV) caches and enabling efficient large-scale search and inference (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025).

1. Mathematical Foundations and Construction

QJL comprises a two-stage process: first, a Johnson–Lindenstrauss (JL) projection matrix reduces a $d$ -dimensional real vector $x \in \mathbb{R}^d$ to a sketch of dimension $m$ via random projection $S \in \mathbb{R}^{m \times d}$ with i.i.d.\ Gaussian entries $S_{ij} \sim \mathcal{N}(0,1)$ . Second, each entry of the projected vector $z = Sx$ is quantized to $\tilde x = \mathrm{sign}(z) \in \{-1, +1\}^m$ , and the norm $\nu(x) = \|x\|_2$ is separately stored in reduced precision.

The core sketch is thus:

$\tilde x = H_S(x) = \mathrm{sign}(Sx)$ ,
Storage per vector: $d$ 0 bits for signs, plus $d$ 1 bits (e.g.~16) for $d$ 2.

For downstream usage where inner products with a query $d$ 3 must be estimated, QJL provides an asymmetric estimator:

$d$ 4

where $d$ 5 is a compressed vector and $d$ 6 is the (shared, fixed) projection matrix (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025).

2. Theoretical Guarantees: Unbiasedness and Distortion

QJL provides unbiased estimators for bilinear forms:

For any $d$ 7, $d$ 8.
The distortion bound satisfies (with high probability)

$d$ 9

for $x \in \mathbb{R}^d$ 0, where $x \in \mathbb{R}^d$ 1 is the total number of stored keys.

Variance is controlled by both projection dimension and vector norm:

For self-reconstruction (e.g., in TurboQuant's second stage), $x \in \mathbb{R}^d$ 2 for each $x \in \mathbb{R}^d$ 3 when $x \in \mathbb{R}^d$ 4 (Zandieh et al., 28 Apr 2025).
This property extends to all queries $x \in \mathbb{R}^d$ 5 by union bound; relative error scales as $x \in \mathbb{R}^d$ 6.

The unbiasedness arises from the rotational invariance of Gaussian projections, with the expectation of sign-product coinciding with the true inner-product up to a known constant.

3. Algorithmic Workflow and Implementation

The construction and application of the QJL transform can be summarized as follows:

Stage	Operation	Computational Cost
Sketch Construction	$x \in \mathbb{R}^d$ 7, $x \in \mathbb{R}^d$ 8, $x \in \mathbb{R}^d$ 9	$m$ 0 mul-add, $m$ 1 sign extraction
Query Processing (Inner Product)	$m$ 2; $m$ 3	$m$ 4 for $m$ 5, $m$ 6 per key

CUDA implementations use memory packing for sign vectors and parallel reductions for batch processing. Shared memory caches $m$ 7 for amortized cost, resulting in high throughput: e.g., for $m$ 8, projection latency $m$ 9ms, per-key estimation $S \in \mathbb{R}^{m \times d}$ 0s/key (Zandieh et al., 2024).

Integration into inference pipelines typically involves:

Fix $S \in \mathbb{R}^{m \times d}$ 1 (with PRNG seed for reproducibility).
For each new vector $S \in \mathbb{R}^{m \times d}$ 2, store $S \in \mathbb{R}^{m \times d}$ 3.
At query time, compute $S \in \mathbb{R}^{m \times d}$ 4 once and reuse across all comparisons.
Proceed with inner product estimation, softmax, and value-weighted aggregation (e.g., in transformer attention) (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025).

4. Applications in Neural Network Quantization

QJL's primary application is the quantization of key/value caches in LLMs and attention-based networks:

For KV cache compression, QJL enables more than $S \in \mathbb{R}^{m \times d}$ 5 reduction in memory (3 bits vs 16 bits) while maintaining or improving accuracy on long-context benchmarks (Zandieh et al., 2024).
On datasets such as LongBench and models including longchat-7B, Llama-2, and Llama-3, the QJL sketch preserves F1 and accuracy within 0.1% of full-precision baseline at 3-bit quantization.
QJL enables 2 $S \in \mathbb{R}^{m \times d}$ 6 acceleration of token generation over exact computation for long sequences, while traditional methods (e.g., KVQuant) may be $S \in \mathbb{R}^{m \times d}$ 7 slower due to added memory and access overhead (Zandieh et al., 2024).

TurboQuant incorporates QJL as a second-stage residual quantizer, following an MSE-optimal quantization step. MSE-optimal quantizers introduce bias in inner product estimation; the 1-bit QJL stage corrects for this by quantizing the residual to produce an unbiased estimator with provably bounded variance. The overall distortion rate matches the best achievable up to a small constant (factor $S \in \mathbb{R}^{m \times d}$ 8), achieving $S \in \mathbb{R}^{m \times d}$ 9 convergence with bits-per-vector $S_{ij} \sim \mathcal{N}(0,1)$ 0 (Zandieh et al., 28 Apr 2025).

5. Comparative Memory and Computational Efficiency

QJL eliminates the per-block memory overheads typical in traditional blockwise or per-channel quantization schemes, which require storing scales and zero-points. For typical block sizes (e.g., $S_{ij} \sim \mathcal{N}(0,1)$ 1), per-entry overhead in classic quantization is $S_{ij} \sim \mathcal{N}(0,1)$ 2 bits for $S_{ij} \sim \mathcal{N}(0,1)$ 3, whereas QJL's per-entry overhead is $S_{ij} \sim \mathcal{N}(0,1)$ 4, vanishing as $S_{ij} \sim \mathcal{N}(0,1)$ 5 grows with $S_{ij} \sim \mathcal{N}(0,1)$ 6 ( $S_{ij} \sim \mathcal{N}(0,1)$ 7). This directly contributes to the observed $S_{ij} \sim \mathcal{N}(0,1)$ 8 reduction in KV cache memory footprint on practical workloads (Zandieh et al., 2024).

Computation is equally efficient: for $S_{ij} \sim \mathcal{N}(0,1)$ 9, QJL decouples query and storage complexity, focusing most cost on a single JL projection per query, making it scalable for large inference batches or retrieval tasks.

6. Practical Considerations and Integration

Random Seed Management: For consistent reconstruction, $z = Sx$ 0 must be fixed and shared, typically by storing a PRNG seed rather than $z = Sx$ 1 itself.
Projection Dimension $z = Sx$ 2: Chosen according to target error $z = Sx$ 3, largest number of stored vectors $z = Sx$ 4, and application-specific tolerance. $z = Sx$ 5 balances accuracy and memory.
CUDA Implementation: Bit-packing and shared-memory dot products enable GPU-parallel batch processing with low overhead.
Hybrid Quantization: For layers or channels with large dynamic ranges or outlier norms, it is practical to quantize only the top- $z = Sx$ 6 channels with higher precision (e.g., 6 bits), while applying QJL to the remainder (Zandieh et al., 2024).
Pipeline Position: Within TurboQuant, QJL is used only for the residual of the initial MSE-optimal quantization, yielding unbiasedness in the final estimated inner product (Zandieh et al., 28 Apr 2025).

7. Empirical Results and Benchmarks

Experimental analysis confirms the theoretical guarantees:

Unbiasedness: Inner-product estimate histograms (TurboQuant) display zero bias across all bit widths. The variance closely matches the predicted $z = Sx$ 7 scaling.
Quality Preservation: Benchmarks (NarrativeQA, Qasper, 2WikiMultiQA) demonstrate no loss—and occasionally improvement—in F1 compared to full-precision baseline at 3-bit quantization (Zandieh et al., 2024).
Speedup: Prompt encoding and quantization incurs only $z = Sx$ 85% overhead compared to no-quantization baseline, while overall inference for long context is $z = Sx$ 9 faster than exact computation due to lower memory traffic and simplified arithmetic.

8. Summary Table: Key Properties of QJL

Property	Description	Reference
Sketch type	1-bit sign of random Gaussian projection, with stored $\tilde x = \mathrm{sign}(z) \in \{-1, +1\}^m$ 0 norm	(Zandieh et al., 2024)
Estimation property	Unbiased inner-product, distortion $\tilde x = \mathrm{sign}(z) \in \{-1, +1\}^m$ 1 with $\tilde x = \mathrm{sign}(z) \in \{-1, +1\}^m$ 2	(Zandieh et al., 2024, Zandieh et al., 28 Apr 2025)
Memory overhead	None (vanishing per-entry as $\tilde x = \mathrm{sign}(z) \in \{-1, +1\}^m$ 3 grows)	(Zandieh et al., 2024)
Dual-stage use (TurboQuant)	1-bit QJL on residual after $\tilde x = \mathrm{sign}(z) \in \{-1, +1\}^m$ 4-bit MSE quantizer; unbiased, variance-bounded	(Zandieh et al., 28 Apr 2025)

In summary, the 1-bit Quantized JL transform offers a mathematically principled, hardware-friendly strategy for compressing vectors while preserving critical linear measurements, eliminating memory overheads of traditional quantization, and enabling accurate and efficient inference at scale (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (2)

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead (2024)

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QJL: 1-Bit Quantized JL Transform.

1-Bit Quantized JL Transform

1. Mathematical Foundations and Construction

2. Theoretical Guarantees: Unbiasedness and Distortion

3. Algorithmic Workflow and Implementation

4. Applications in Neural Network Quantization

5. Comparative Memory and Computational Efficiency

6. Practical Considerations and Integration

7. Empirical Results and Benchmarks

8. Summary Table: Key Properties of QJL

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

1-Bit Quantized JL Transform

1. Mathematical Foundations and Construction

2. Theoretical Guarantees: Unbiasedness and Distortion

3. Algorithmic Workflow and Implementation

4. Applications in Neural Network Quantization

5. Comparative Memory and Computational Efficiency

6. Practical Considerations and Integration

7. Empirical Results and Benchmarks

8. Summary Table: Key Properties of QJL

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research