Unified Table-Lookup Abstraction

Updated 17 November 2025

Unified Table-Lookup Abstraction is a design paradigm that recasts computations as indexed table lookups, enabling precise control over space, time, and accuracy trade-offs.
It employs parameterization (e.g., tunable λ, γ, bit-width, and centroid counts) to replace traditional arithmetic and control operations with precomputed, hardware-aware table accesses.
The abstraction supports diverse applications—from quantum lookup oracles and low-bit neural inference on CPUs/FPGAs/NPUs to efficient static data retrieval—yielding significant energy, speed, and resource optimizations.

The unified table-lookup abstraction is a design paradigm that reinterprets various data access and computation tasks as table lookups rather than traditional arithmetic or control operations. Across quantum computing, neural network inference (CPU/FPGA/NPU), and compressed static data retrieval, this abstraction enables implementations that optimize for gate count, energy, memory, or latency by explicitly trading classic computational primitives for precomputed, indexed table accesses. Recent research has systematically unified and parameterized these approaches, demonstrating that table-based formulations can interpolate between classic baselines and uncover new, sublinear scaling regimes in their respective metrics.

1. General Principles of the Unified Table-Lookup Abstraction

The core idea is to characterize a broad class of data-intensive or computation-intensive primitives as lookups into preconstructed tables, which may be realized in quantum circuits, classical memory, or hardware-mapped logic. The unification emerges via:

Parameterization: Introducing tunable parameters (block size, codebook size, bit-plane grouping, etc.) that control space-time-accuracy tradeoffs and interpolate between previous specialized lookup schemes.
Functional Reinterpretation: Replacing explicit compute or control (e.g., multiplications, branching, routing) by the application of indexed table reads, often exploiting structure in the data (quantization, redundancy, static keys).
Hardware-Aware Decomposition: Tailoring table layout and access patterns to hardware constraints (locality, bandwidth, parallelism, resource sharing).

This approach is instantiated in quantum lookup oracles (Zhu et al., 26 Jun 2024), low-bit neural network inference on CPUs (Wei et al., 25 Jun 2024), NPUs (Wei et al., 14 Nov 2025), FPGAs (Gerlinghoff et al., 18 Mar 2024), compressed static retrieval (Coleman et al., 2023), and neural network approximation via centroid-based tables (Tang et al., 2023).

2. Quantum Table Lookups: Parameterized Circuits for Oracle Implementation

Zhu–Sundaram–Low (Zhu et al., 26 Jun 2024) define a parameterized quantum circuit architecture for table-lookup oracles, crucial for quantum machine learning and simulation. Let the memory have $N=2^n$ entries. The abstraction introduces tunable parameters $\lambda, \gamma$ , controlling the circuit's space (ancilla qubit count), time (T-count), and error resilience. The circuit consists of three conceptual stages:

Address Routing (Stage I):
- $d=\log_2(N/\lambda)$ linear routers partition memory into $N/\lambda$ blocks.
- A CSWAP-tree (depth $d'$ ) further routes the active block to one of $2^{d'}$ subblocks of size $\gamma$ .
Scatter + Data Fan-In (Stage II):
- For each block index $i$ , a control qubit $q_i$ is prepared, routed and fanned out via a CNOT-tree to apply controlled-NOTs from selected memory cells.
Route-Out (Stage III):
- The resulting $\gamma$ qubits are routed to the output bus via another CSWAP-tree.

The resource requirements, trading among qubit-width $Q=O(\log(N/\lambda)+\lambda)$ , T-count $T=O(N/\gamma+(N/\lambda)d+\lambda)$ , and infidelity $E$ (error) due to various error sources, are explicitly parameterized. By choosing $(\lambda,\gamma)$ , the circuit recovers QRAM (all-to-all, $Q=O(N)$ ), QROM (log-width), SELECT–SWAP (root-N width), or new sublinear-in-all-metrics regimes (e.g., $Q=O(N^{1/2}), T=O(N^{3/4}), E=O(N^{3/4}\mathrm{polylog}N)$ ).

Crucially, all scaling regimes can be achieved with only 2D nearest-neighbor connectivity by recursive H-tree layouts and teleportation-based SWAPs. This unifies and generalizes existing quantum lookup schemes within a single architecture.

3. Unified Table-Lookup in Low-Bit Neural Network Inference

Table-lookup abstractions in neural network inference replace or approximate large-scale matrix multiplications with structured, indexed access to precomputed tables.

3.1. CPU/Edge: Bit-Serial mpGEMM via T-MAC

T-MAC (Wei et al., 25 Jun 2024), for low-bit LLMs on CPUs, decomposes a quantized weight matrix $W \in \{0,\dots,2^b-1\}^{M \times K}$ into $b$ one-bit planes $W^i$ . Each mixed-precision matmul $A \times W$ is expressed as:

$A \times W = \sum_{i=0}^{b-1} 2^i (A \times W^i)$

Each term uses a precomputed table $T_i$ of partial products, indexed by $g$ -bit packed weight patterns, eliminating arithmetic multiplications:

Table size per plane: $2^g \times N_\mathit{tn}$ (fixed, scales linearly with $b$ ).
Efficient CPU implementation leverages SIMD table lookup instructions (NEON, AVX2).
Resource footprint and throughput scale with $b$ , with empirical results showing up to $4\times$ throughput and $70\%$ energy reduction over previous CPU LLM systems.

The abstraction generalizes across bit-widths and hardware targets, supporting kernel synthesis for TVM/XLA and enabling embedded accelerators to eliminate multipliers altogether.

3.2. NPU: Fused Table-Based Dequantization in T-MAN

T-MAN (Wei et al., 14 Nov 2025) extends the abstraction for NPUs, subsuming unsupported operations (e.g., affine dequantization) into two-level LUT pipelines:

Level 1 (Repacking LUT $\mathcal{R}$ ): Converts packed binary bit-planes into parallel quantized integers.
Level 2 (Conversion LUT $\mathcal{C}$ ): Applies scale and zero-point affine dequantization. The process is entirely LUT-driven:

$x_{0:G-1} = \Bigl[\, \mathcal{C}\bigl[\;\sum_{b=0}^{B-1}\mathcal{R}[\beta_{j,b}]\ll b\;\bigr] \Bigr]_{j=0}^{G-1}$

Unified tile layouts and concurrency-aware tiling allow a single table arrangement to serve both prefill (GEMM) and decoding (GEMV), with up to $3.1\times$ decoding and $1.4\times$ prefill speedups, and $84\%$ energy drop relative to prior NPU methods.

3.3. FPGA: Bit-Serial LUT-MAC via TLMAC

TLMAC (Gerlinghoff et al., 18 Mar 2024) synthesizes multiplies and accumulates entirely as lookups into FPGA LUTs:

Activations are expanded bit-serially; each group of $G \leq 6$ bits (LUT-6) addresses a table whose contents encode all possible dot-products with weights.
Parameterized place-and-route algorithms cluster and map unique weight groups, minimize LUT and routing footprint via clustering and simulated annealing.
Scalable to full ImageNet ResNet-18 in 3 bits with $71.94\%$ Top-1 accuracy, $12\times$ less logic area than prior binary methods, eliminating explicit weight movement from memory.

4. Lookup-Table Abstraction for Centroid-Based and Compressed Static Access

Beyond direct matrix arithmetic, unified table-lookup abstractions underpin both explicit model approximation and static data retrieval.

4.1. Centroid-Based Neural Approximations (LUT-NN)

LUT-NN (Tang et al., 2023) partitions each input vector into $C$ chunks, learns $K$ centroids per chunk, and precomputes the effect (e.g., matrix products) of each centroid. At inference, the input is assigned to nearest centroids (softmax-straight-through during training), reducing the original multiply-accumulate to indexed table summation with quantization-aware updates. The compression/accuracy tradeoff is controlled by chunk size and centroid count, yielding $5\times-16\times$ FLOP savings and $<2.5\%$ accuracy drop across image and LLMs.

4.2. Succinct Static Lookup via Compressed Static Functions (CARAMEL)

CARAMEL (Coleman et al., 2023) addresses lookup of static key–row-valued data, prevalent in embedding tables and NLP token indices. Each value column is stored by a Bloom filter (when beneficial) and a compressed static function (CSF) using entropy-optimal prefix codes and XORSAT-based hash decoders. The construction achieves space close to $\sum_i N H_0(A_i)$ (empirical entropy), $O(1)$ per-value lookup, and supports practical compression ratios up to $16\times$ in real systems.

5. Parameterizations, Trade-Offs, and Unified Regimes

A central advance is the explicit parameterization of the table-lookup abstraction via tunable variables ( $\lambda, \gamma$ , centroid count $K$ , chunk/group size $g$ , bit-width $b$ , tile size, etc.) that allow continuous navigation among resource-accuracy-latency envelopes. For example, in quantum architectures (Zhu et al., 26 Jun 2024):

Scheme	Qubit width	T-count	Error/Infidelity	Connectivity
QRAM (bucket)	$O(N)$	$O(N)$	$O(\log^2 N)$	all-to-all
QROM	$O(\log N)$	$O(N)$	$O(N)$	all-to-all
SELECT–SWAP	$O(\sqrt{N})$	$O(\sqrt{N})$	$O(N)$	all-to-all
Unified λ,γ	$O(\log(N/\lambda)+\lambda)$	$O(N/\gamma + (N/\lambda)d+\lambda)$	polylog	2D local

Analogous tradeoffs appear in table-based CPU/FPGA/NPU neural inference (memory-use vs. FLOPs vs. accuracy).

For many applications, previously unattainable regimes—such as simultaneous sublinear scaling in all crucial metrics—emerge. In the quantum setting, setting $(\lambda, \gamma) = (N^{1/2}, N^{1/4})$ yields $Q=O(N^{1/2}), T=O(N^{3/4}),$ and $E=O(N^{3/4} \mathrm{polylog} N)$ . Similar crossover regimes (e.g., in LUT-NN or TLMAC) allow one to interpolate between memory-dominant, compute-dominant, or balanced solutions.

6. Practical Considerations, Optimizations, and Deployment

Across domains, practical efficiency relies on advanced table construction, memory and data layout, parallelization, and hardware-specific optimizations:

SIMD vectorization and instruction-level parallelism for table-indexed kernel execution in CPU and NPU inference (Wei et al., 25 Jun 2024, Wei et al., 14 Nov 2025).
Memory layout and cache-awareness to improve access locality, as in table tiling (Wei et al., 14 Nov 2025), column/bit grouping, or centroid-stationary caches (Tang et al., 2023).
Clustering and routing optimizations in FPGA mapping (Gerlinghoff et al., 18 Mar 2024), reducing LUT utilization and logic congestion via clustering/annealing.
Row- and column-permutation heuristics to minimize compressed static table entropy in CARAMEL, reducing space by up to $40\%$ on benchmarks (Coleman et al., 2023).
Multi-level pipelining to orchestrate concurrent data movement, table unpacking, and kernel execution (Wei et al., 14 Nov 2025).

Suitability depends on workload properties: access pattern predictability, key set mutability (CARAMEL is static only), hardware constraints (register/tile size, vector width), and acceptable lossiness (as in LUT-NN or quantized models).

7. Broader Impact and Future Directions

The unified table-lookup abstraction provides a convergent formulation applicable from quantum algorithms to edge AI, neural accelerators, and large-scale compressed static mapping. It enables deployment scenarios—such as LLMs on $<$ 2 W SoCs, full ImageNet CNNs on soft logic FPGAs, or polylog-depth quantum memory access with local connectivity—that are otherwise unattainable with standard computational models.

Future directions include:

Extending activation quantization to symmetric table-lookup inference (as suggested in T-MAC).
Automated backend integration (e.g., TVM/XLA) for model compilation into LUT kernels.
Hardware support for table-lookup primitives in NPUs and custom ASICs.
Dynamic, updatable static lookup schemes (an open limitation in CARAMEL).
Deep theoretical analysis of tradeoff envelopes (e.g., lower bounds across all table-based inference schemes).

The unified table-lookup abstraction thus serves both as a formal framework and a practical toolkit to optimize resource-intensive computation and data retrieval across classical and quantum architectures.