Two-Level LUT-based Dequantization

Updated 17 November 2025

Two-level LUT-based dequantization is a hierarchical method that employs staged lookup tables to reconstruct low-bit quantized signals with minimal distortion.
It integrates efficient memory compression, ultra-low latency, and high-fidelity recovery for ADC post-correction, matrix multiplication, and LLM inference.
The technique leverages optimized table construction, bit-masking indexing, and hierarchical quantization schemes to achieve significant gains in computational efficiency and numerical robustness.

Two-level LUT-based dequantization is a class of computational and hardware techniques for reconstructing (dequantizing) low-bit-width quantized data using a hierarchical or staged lookup-table (LUT) mechanism, where each level transforms or refines quantized representations to recover the original signal or weight values with minimal distortion and maximal efficiency. This paradigm is significant in fields ranging from ADC signal recovery (Kasher et al., 24 Jul 2025, Kasher et al., 24 Jul 2025) to low-bit matrix multiplication (Kaplan et al., 19 May 2025) and LLM inference acceleration on edge devices and NPUs (Wei et al., 14 Nov 2025, Nie et al., 22 Oct 2025). Two-level LUT architectures enable exponential compression of the memory footprint, ultra-low latency logic depth, and superior numerical performance relative to single-level or arithmetic dequantization, provided that quantization models and table construction are properly optimized.

1. Mathematical Formulations and Core Principles

The two-level LUT-based dequantization approach decomposes the recovery (dequantization) process into discrete stages, each implemented via table lookup:

Signal Model (ADC domain): Signals $x_n$ are quantized through $Q_b(x)$ , resulting in a discrete output $y_n \in \{1, ..., 2^b\}$ , often with additive Gaussian dither $w_n \sim \mathcal{N}(0, \sigma^2)$ (Kasher et al., 24 Jul 2025). The MMSE estimator

$\hat{x}_{0,\rm MMSE}(\mathbf{y}) = \frac{\int x_0\,p(x_0)\,p(\mathbf{y}| x_0)\,dx_0}{\int p(x_0)\,p(\mathbf{y}| x_0)\,dx_0}$

is computed offline for each LUT entry indexed by quantized history.

Hierarchical Quantization (Matrix multiplication, LLM inference): A hierarchical quantizer splits the quantization into coarse/fine layers:
- Nested-lattice (matrix): Decompose into layers $g_0(x)$ , $g_1(x)$ (each in a base codebook $A_q$ and scaled copy $qA_q$ ), reconstruct with
$\hat{x} = g_0(x) + g_1(x)$

Enabling direct LUT-based inner-product decoding using $|A_q|^2 = 2^{2d(R/2)}$ size instead of $2^{2dR}$ in single-level (Kaplan et al., 19 May 2025). - Hierarchical Linear Quantization (HLQ, LLM inference):

1. Coarse quantization: $q_1[i]=$ index under scale $s_1$ , zero-point $z_1$ . 2. Quantize residual $r[i] = w[i] - (q_1[i] s_1 + z_1)$ with $q_2[i]$ under $s_2$ . Reconstruction:

$w[i] \approx LUT_1[q_1[i]] + LUT_2[q_2[i]]$

(Nie et al., 22 Oct 2025).

Vectorized two-level LUT (NPUs): First, bit-plane repacking $T_1^{(b)}$ arranges quantized weights, then affine dequantization $T_2$ outputs scaled FP16 values in a single vectorized pass (Wei et al., 14 Nov 2025).

A common principle is that each stage—be it analog modeling, codebook selection, residual quantization, or bit manipulation—can be solved analytically and encoded in a LUT indexed by compact and optimized addresses.

2. Table Construction, Indexing, and Memory Reduction

Memory efficiency is achieved by compressing the table size through multi-level design and optimized indexing:

Bit-Masking Indexing: Rather than indexing over the full $bN$ -bit history, select a subset of $\beta$ bits, constructing masked decimal indices:

$d_n = 1 + \sum_{i=1}^{b} \gamma_{n,i} q_{n,i} 2^{b-i}$

Greedy minimization of an analytic MSE proxy $H_3(\mathbf{q})$ (Algorithm 1) yields $2^\beta$ LUT entries instead of $2^{bN}$ (Kasher et al., 24 Jul 2025).

High-Probability Indexing (HPI): Retain only the subset $\mathcal{D}_\epsilon$ of indices such that

$\sum_{\mathbf{d}\in \mathcal{D}_\epsilon} p(\mathbf{d}) \geq \epsilon$

Monte Carlo set-building (Algorithm 2) reduces table size by $>$ 10,000 $\times$ , with marginal MSE and SFDR loss ( $<0.5$ dB/ $<1$ dBc for $\epsilon\approx0.9$ and $N=7$ ) (Kasher et al., 24 Jul 2025).

Hierarchical Codebooks (Nested-Lattice, HLQ): In $M=2$ -layer schemes, only the coarser layer codebook needs full precomputed products or reconstructions, reducing LUT storage from $2^{2dR}$ (single-level) to $2^{2d(R/M)}$ ( $M$ layers) (Kaplan et al., 19 May 2025). HLQ uses two LUTs with $2^{b_1}$ and $2^{b_2}$ entries, both fits within L1 cache for practical $g$ (group size) (Nie et al., 22 Oct 2025).
NPU Tiling: NPUs store first-level repacking LUT and second-level scaling LUT within on-chip memory ( $\leq$ 2 KB for a typical configuration), amortized over large numbers of MACs (Wei et al., 14 Nov 2025).

Efficiency trade-offs are navigated by moving along the triple $(\beta, \epsilon, \rho)$ Pareto front to optimize desired accuracy versus memory (Kasher et al., 24 Jul 2025).

3. Hardware and Real-Time Implementation

Two-level LUT designs directly enable ultra-low latency logic and efficient resource usage:

Combinational Logic (ADC): The compressed LUT is implemented as two-level logic:

$b_j = \bigvee_{\mathbf{d}\in\mathcal{D}_\epsilon} [\bigwedge_{i=1}^\beta (d_i \oplus \bar{d}_i)]$

Only two gate delays separate input from output; flip-flops are eliminated, mitigating clock skew and supporting multi-GHz throughput (Kasher et al., 24 Jul 2025).

NPUs (T-MAN): Vectorized LUT (VLUT16) instructions on HVX cores reconstruct k-bit weights from packed bit-planes, followed by affine scaling using T2, fully fusing dequantization and exposing no arithmetic or bit-fiddling in the compute path. Prefill and decode are pipelined with DMA and matrix multiply on HMX, hiding dequantization cost behind memory latency (Wei et al., 14 Nov 2025).
CPUs (ELUTQ): LUTs are organized for cache-line–aligned loads (NEON); shuffle instructions allow the entire LUT group to be fetched in one go, eliminating scalar FP16 arithmetic per weight. Table quant-in-memory further interleaves int8 pairs for even higher efficiency (Nie et al., 22 Oct 2025).

Memory footprints can be reduced from multiple Megabytes to hundreds of bytes (ADC post-correction) or kilobytes (LLM quantization), enabling deployment in tightly resource-constrained FPGA, ASIC, or edge SoC environments.

4. Quantitative Performance and Analytical Bounds

Comparative metrics illustrate the power and limitations of two-level LUT dequantization:

Application Area	Key Metrics	Two-Level LUT Gain
ADC SFDR	SFDR, MSE	$>$ 19 dBc SFDR, $>$ 9 dB MSE gain (324 B RAM), output precision unchanged (Kasher et al., 24 Jul 2025)
Matrix Mult	Mean-squared distortion	LUT size compresses $2^{2dR}\to 2^{2d(R/2)}$ , distortion increase $<0.1$ bit (Kaplan et al., 19 May 2025)
LLM Inference	Perplexity, Throughput	$>$ 8% PPL reduction at 3-bit, 85% at 2-bit; 2.5–3.4 $\times$ speedup at low bits, $>$ 25 token/s on Apple M2 (Nie et al., 22 Oct 2025); 1.4 $\times$ prefill, 3.1 $\times$ decode speedup, 84% energy saving (NPUs) (Wei et al., 14 Nov 2025)

Bounds for hierarchical lattice codes:

$A_{q^2(1-\frac{1}{q})}\subset C_{L,q,2} \subset A_{q^2(1+\frac{1}{q})}$

for $q=2^{R/2}$ , imply the two-level distortion is essentially optimal at moderate rates ( $q\gtrsim4$ ), with the loss vanishing as $q$ grows (Kaplan et al., 19 May 2025).

In LLM inference, HLQ-GPTQ reduces PPL at both 2- and 3-bit quantization by margins ( $>$ 85% gain at 2-bit) unattainable by standard uniform quantizers, while practical run-times achieve $>$ 25 tokens/s on edge CPUs and multiple $\times$ speedups versus prior art (Nie et al., 22 Oct 2025, Wei et al., 14 Nov 2025).

5. Dithering, Spectral Shaping, and Numerical Robustness

Dithering at various stages is used to flatten quantization error spectra and suppress spurious tones:

Post-Quantization Digital Dithering: Three architectures are employed (Kasher et al., 24 Jul 2025):
- Intra-table: One dither per entry, lowest memory.
- Inter-table: Multiple parallel tables, randomly select a dithered value per index.
- Post-table: Add dither at runtime to full-precision estimate, ideal for spur suppression.
- All methods penalize MSE by a known 3 dB but afford large ( $>$ 19 dBc) SFDR gains, especially in wideband and harmonic-rich contexts.
Recovery under Interference: Bayesian MMSE, ML, and MAP estimators can be applied to tables indexed by parametric histories of quantizer outputs; tractable analytic approximations for BPSK and LFM signals allow real-time error correction and robust interference cancellation (Kasher et al., 24 Jul 2025).

Signal reconstruction remains correct so long as input priors are known; nonstationary or mis-modeled inputs degrade performance in the MMSE design stage. This is an inherent limitation for model-driven rather than data-driven table construction.

6. Practical Considerations and Application Domains

Two-level LUT dequantization is suitable for several high-impact applications:

Direct-RF ADC Correction: Real-time post-quantization correction of wideband direct-RF receivers, spectrum analyzers; LUT in $\leq$ 1 kB RAM operates with $O(1)$ latency and two-gate logic depth (Kasher et al., 24 Jul 2025, Kasher et al., 24 Jul 2025).
Matrix Multiplication on CPUs: Enables high-rate, low-distortion quantized GEMM for ML workloads within L1 cache capacity. Four LUT accesses per product are amortized across vector tiles; distortion penalty is negligible at target rates (Kaplan et al., 19 May 2025).
Low-bit LLMs on Edge Devices: Hierarchical quantization (HLQ) and bit-serial LUT kernels permit 2–3-bit LLM inference at high throughput and accuracy without costly dequant arithmetic. Empirically, HLQ outperforms uniform quantization in both perplexity and speed, enabling broader deployment of large models on consumer CPUs (Nie et al., 22 Oct 2025).
End-to-End Quantization on NPUs: Two-level VLUT kernels eliminate all on-line arithmetic for dequantization. Unified tiling and data layout enable efficient concurrent execution of DMA, vector, and matrix engines; the method achieves strict accuracy improvements versus hardware-constrained baselines (Wei et al., 14 Nov 2025).

7. Limitations, Trade-Offs, and Outlook

Performance is bounded by the validity of the parametric signal model in signal recovery, the quantization rate/group size chosen in ML inference, and the ability to fit LUTs into hardware caches or TCM. While memory reductions up to $>$ 10,000 $\times$ are observed via index compression, increasing $\beta$ or relaxing $\epsilon$ can degrade accuracy; managing this Pareto frontier is crucial.

In settings where model parameters or interference statistics are unknown or time-varying, analytic estimation may be suboptimal; adaptive schemes or hybrid LUT/neural approaches may be required. Nevertheless, two-level LUT designs, especially when combined with optimal indexing, dithering, and robust quantization (nested-lattice, HLQ), set the reference for ultra-efficient post-correction and quantized computation in bandwidth- and resource-constrained domains.