Papers
Topics
Authors
Recent
Search
2000 character limit reached

1-Bit Quantized JL Transform

Updated 26 March 2026
  • The paper introduces a 1-bit Quantized JL transform that leverages a Gaussian JL projection followed by sign-bit quantization to create compact binary sketches preserving inner product structure.
  • It provides unbiased inner-product estimators with controlled variance and distortion, achieving near-optimal performance with minimal storage via a two-stage process.
  • Practical implementations demonstrate significant KV cache compression and accelerated inference through GPU-optimized batch processing and hybrid quantization strategies.

A 1-bit Quantized Johnson–Lindenstrauss (QJL) transform is a data-oblivious compression technique that creates ultra-compact binary sketches of high-dimensional vectors while preserving geometric structure for inner product estimation. QJL leverages a random Gaussian projection (Johnson–Lindenstrauss transform) followed by sign-bit (1-bit) quantization, allowing unbiased and low-variance estimation of inner products and 2\ell_2 norms. It achieves near-optimal distortion at minimal storage and computational cost, with particular efficacy in compressing neural network Key-Value (KV) caches and enabling efficient large-scale search and inference (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025).

1. Mathematical Foundations and Construction

QJL comprises a two-stage process: first, a Johnson–Lindenstrauss (JL) projection matrix reduces a dd-dimensional real vector xRdx \in \mathbb{R}^d to a sketch of dimension mm via random projection SRm×dS \in \mathbb{R}^{m \times d} with i.i.d.\ Gaussian entries SijN(0,1)S_{ij} \sim \mathcal{N}(0,1). Second, each entry of the projected vector z=Sxz = Sx is quantized to x~=sign(z){1,+1}m\tilde x = \mathrm{sign}(z) \in \{-1, +1\}^m, and the norm ν(x)=x2\nu(x) = \|x\|_2 is separately stored in reduced precision.

The core sketch is thus:

  • x~=HS(x)=sign(Sx)\tilde x = H_S(x) = \mathrm{sign}(Sx),
  • Storage per vector: mm bits for signs, plus bb bits (e.g.~16) for x2\|x\|_2.

For downstream usage where inner products with a query qRdq \in \mathbb{R}^d must be estimated, QJL provides an asymmetric estimator:

q,k^=π/2mν(k)Sq,HS(k),\widehat{\langle q, k \rangle} = \frac{\sqrt{\pi/2}}{m} \cdot \nu(k) \langle S q, H_S(k) \rangle,

where kk is a compressed vector and SS is the (shared, fixed) projection matrix (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025).

2. Theoretical Guarantees: Unbiasedness and Distortion

QJL provides unbiased estimators for bilinear forms:

  • For any q,kRdq, k \in \mathbb{R}^d, ES[q,k^]=q,k\mathbb{E}_S[\widehat{\langle q, k \rangle}] = \langle q, k \rangle.
  • The distortion bound satisfies (with high probability)

q,k^q,kεq2k2,|\widehat{\langle q, k \rangle} - \langle q, k \rangle| \le \varepsilon\,\|q\|_2\,\|k\|_2,

for m=O(ε2logn)m = O(\varepsilon^{-2} \log n), where nn is the total number of stored keys.

Variance is controlled by both projection dimension and vector norm:

  • For self-reconstruction (e.g., in TurboQuant's second stage), Var[y,r^](π/(2d))y22r22\operatorname{Var}[\langle y, \hat r \rangle] \le (\pi/(2d)) \|y\|_2^2 \|r\|_2^2 for each yRdy\in\mathbb{R}^d when d=md=m (Zandieh et al., 28 Apr 2025).
  • This property extends to all queries qq by union bound; relative error scales as O(1/m)O(1/\sqrt{m}).

The unbiasedness arises from the rotational invariance of Gaussian projections, with the expectation of sign-product coinciding with the true inner-product up to a known constant.

3. Algorithmic Workflow and Implementation

The construction and application of the QJL transform can be summarized as follows:

Stage Operation Computational Cost
Sketch Construction z=Sxz = Sx, x~=sign(z)\tilde x = \mathrm{sign}(z), ν(x)\nu(x) O(dm)O(dm) mul-add, O(m)O(m) sign extraction
Query Processing (Inner Product) u=Squ = S q; q,k^=(π/2/m)ν(k)u,k~\widehat{\langle q, k \rangle} = (\sqrt{\pi/2}/m) \nu(k) \langle u, \tilde k \rangle O(dm)O(dm) for uu, O(m)O(m) per key

CUDA implementations use memory packing for sign vectors and parallel reductions for batch processing. Shared memory caches SqS q for amortized cost, resulting in high throughput: e.g., for d=4096,m=256d = 4096, m = 256, projection latency 0.2\sim 0.2ms, per-key estimation 0.1μ\sim 0.1\,\mus/key (Zandieh et al., 2024).

Integration into inference pipelines typically involves:

  1. Fix SS (with PRNG seed for reproducibility).
  2. For each new vector xx, store (x~,ν(x))(\tilde x, \nu(x)).
  3. At query time, compute SqS q once and reuse across all comparisons.
  4. Proceed with inner product estimation, softmax, and value-weighted aggregation (e.g., in transformer attention) (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025).

4. Applications in Neural Network Quantization

QJL's primary application is the quantization of key/value caches in LLMs and attention-based networks:

  • For KV cache compression, QJL enables more than 5×5\times reduction in memory (3 bits vs 16 bits) while maintaining or improving accuracy on long-context benchmarks (Zandieh et al., 2024).
  • On datasets such as LongBench and models including longchat-7B, Llama-2, and Llama-3, the QJL sketch preserves F1 and accuracy within 0.1% of full-precision baseline at 3-bit quantization.
  • QJL enables 2×\times acceleration of token generation over exact computation for long sequences, while traditional methods (e.g., KVQuant) may be 2×2\times slower due to added memory and access overhead (Zandieh et al., 2024).

TurboQuant incorporates QJL as a second-stage residual quantizer, following an MSE-optimal quantization step. MSE-optimal quantizers introduce bias in inner product estimation; the 1-bit QJL stage corrects for this by quantizing the residual to produce an unbiased estimator with provably bounded variance. The overall distortion rate matches the best achievable up to a small constant (factor 2.7\approx 2.7), achieving O(1/4b)O(1/4^b) convergence with bits-per-vector bb (Zandieh et al., 28 Apr 2025).

5. Comparative Memory and Computational Efficiency

QJL eliminates the per-block memory overheads typical in traditional blockwise or per-channel quantization schemes, which require storing scales and zero-points. For typical block sizes (e.g., B=32B=32), per-entry overhead in classic quantization is 0.5\sim 0.5 bits for bf=8b_f=8, whereas QJL's per-entry overhead is bf/mb_f/m, vanishing as mm grows with nn (m=O(logn/ε2)m = O(\log n/\varepsilon^2)). This directly contributes to the observed 5.3×5.3\times reduction in KV cache memory footprint on practical workloads (Zandieh et al., 2024).

Computation is equally efficient: for mdm\ll d, QJL decouples query and storage complexity, focusing most cost on a single JL projection per query, making it scalable for large inference batches or retrieval tasks.

6. Practical Considerations and Integration

  • Random Seed Management: For consistent reconstruction, SS must be fixed and shared, typically by storing a PRNG seed rather than SS itself.
  • Projection Dimension mm: Chosen according to target error ε\varepsilon, largest number of stored vectors nn, and application-specific tolerance. mcε2lognm \approx c\, \varepsilon^{-2} \log n balances accuracy and memory.
  • CUDA Implementation: Bit-packing and shared-memory dot products enable GPU-parallel batch processing with low overhead.
  • Hybrid Quantization: For layers or channels with large dynamic ranges or outlier norms, it is practical to quantize only the top-rr channels with higher precision (e.g., 6 bits), while applying QJL to the remainder (Zandieh et al., 2024).
  • Pipeline Position: Within TurboQuant, QJL is used only for the residual of the initial MSE-optimal quantization, yielding unbiasedness in the final estimated inner product (Zandieh et al., 28 Apr 2025).

7. Empirical Results and Benchmarks

Experimental analysis confirms the theoretical guarantees:

  • Unbiasedness: Inner-product estimate histograms (TurboQuant) display zero bias across all bit widths. The variance closely matches the predicted (π/2d)(\pi/2d) scaling.
  • Quality Preservation: Benchmarks (NarrativeQA, Qasper, 2WikiMultiQA) demonstrate no loss—and occasionally improvement—in F1 compared to full-precision baseline at 3-bit quantization (Zandieh et al., 2024).
  • Speedup: Prompt encoding and quantization incurs only \sim5% overhead compared to no-quantization baseline, while overall inference for long context is 2×2\times faster than exact computation due to lower memory traffic and simplified arithmetic.

8. Summary Table: Key Properties of QJL

Property Description Reference
Sketch type 1-bit sign of random Gaussian projection, with stored 2\ell_2 norm (Zandieh et al., 2024)
Estimation property Unbiased inner-product, distortion ε\varepsilon with m=O(ε2logn)m=O(\varepsilon^{-2} \log n) (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025)
Memory overhead None (vanishing per-entry as nn grows) (Zandieh et al., 2024)
Dual-stage use (TurboQuant) 1-bit QJL on residual after (b1)(b-1)-bit MSE quantizer; unbiased, variance-bounded (Zandieh et al., 28 Apr 2025)

In summary, the 1-bit Quantized JL transform offers a mathematically principled, hardware-friendly strategy for compressing vectors while preserving critical linear measurements, eliminating memory overheads of traditional quantization, and enabling accurate and efficient inference at scale (Zandieh et al., 2024, Zandieh et al., 28 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QJL: 1-Bit Quantized JL Transform.