Papers
Topics
Authors
Recent
2000 character limit reached

Vector LUT-Centric Tensor Layout

Updated 13 December 2025
  • Vector LUT-Centric Tensor Layout is a method that fuses multiple tokens into unified table lookups, dramatically increasing cache utilization and memory bandwidth for neural network inference.
  • It employs tensor transformations and strategic table packing to align data with SIMD architectures, minimizing random memory accesses and cache thrashing.
  • The approach yields significant speedups and energy efficiency improvements in LLMs and CNNs through optimized tiling, quantization, and integrated hardware co-design.

Vector LUT-Centric Tensor Layout describes a set of tensor organization, dataflow, and table-packing strategies designed to maximize the throughput and memory efficiency of ultra-low-bit neural network inference via massively parallel vectorized table lookups. The term encompasses architectural and algorithmic design choices that fuse multiple dimensions of input, output, and token batch into the lookup table (LUT) axes, thereby aligning tensor layout with the requirements of modern SIMD CPUs, NPUs, or ASICs. Originating from bottlenecks in traditional scalar LUT inference for LLMs and CNNs, vector LUT-centric layouts are foundational for saturating memory bandwidth, exploiting parallelism, and enabling advanced quantization and sparsity schemes in edge-oriented and accelerator co-design scenarios (Li et al., 6 Dec 2025, Mo et al., 12 Aug 2024, Zhang et al., 12 Oct 2025, Huang et al., 17 Sep 2025).

1. Paradigms: Scalar vs. Vector LUT and Memory-Access Patterns

Scalar LUT layouts conduct per-token (1→1) lookups: for NN concurrent tokens, NN separate LUTs are instantiated, each of size S≈3gS\approx 3^g (for group size gg). Memory accesses are random and small, as every token independently looks up its activation–weight group index, typically requiring NN independent, cache-unfriendly loads of 16–32 bytes each. This leads to cache thrashing, low cache line utilization (<10%), and global memory bandwidth (BW) utilization under 40% (Li et al., 6 Dec 2025).

Vector LUT-centric layouts fuse multiple tokens into a single dimension of a unified T∈RK/g×3g×NT\in\mathbb{R}^{K/g\times 3^g \times N} table. Each lookup for a weight-group index is performed once per group, but returns a contiguous vector of NN outputs:

  • A stride-1 read fetches all NN token results simultaneously (e.g., 32×232\times2 bytes = 64 B per load), perfectly matching the SIMD width and L1/L2 cache line size.
  • This transformation elevates cache-line utilization to ∼100%\sim 100\% and memory BW utilization to nearly the system maximum (Li et al., 6 Dec 2025).
  • For tile-based implementations (e.g., LUT Tensor Core (Mo et al., 12 Aug 2024), TENET (Huang et al., 17 Sep 2025)), the layout ensures each table tile or activation–weight tile is amortized across a large number of outputs, minimizing random accesses and maximizing element reuse.

2. Tensor Layout Transformation and Table Packing

Tensor layouts under vector LUT-centric regimes are specifically structured for fused, high-inflammatory access patterns and minimal redundant copying.

  • Activation and Output Layout Transformation: In vector-LUT, activations are transformed from feature-major (K×N) representations to token-major (N×K), enabling the fusion of the N-token axis into the table's stride-1 index. LUT precompute can interleave any required transpose, eliminating the need for explicit out-of-place copies. In llama.cpp, feature-major activations are immediately scattered into the vector-LUT axis (Li et al., 6 Dec 2025).
  • LUT Table Storage: TT is physically stored as a flat array laid out in row-major order over (k, i, n); nn (token) is the fastest varying dimension. The offset for element T[k,i,n]T[k, i, n] is ((kâ‹…3g+i)â‹…N+n)((k \cdot 3^g + i) \cdot N + n). Efficient tiling along both kk and nn constrains the working set to small tiles that fit within L1 cache.
  • Weight Packing and Tile-Contiguity: Weights are grouped into chunks of gg, mapped to an index via idx=∑j=0g−1(wj+1)3jidx = \sum_{j=0}^{g-1}(w_j + 1) 3^j, then packed into 2D tiles (Mtile×Ktile/gM_\text{tile} \times K_\text{tile}/g) for contiguous memory access during compute.
  • Lattice Vector Quantization for CNNs: CNN LUT methods extend vectorization into the input patch space. An optimal lattice quantizer (QΛ(B)(x)Q_{\Lambda(B)}(x)) is learned, where BB is a diagonal or full-rank basis determining adaptive step-sizes (per coordinate quantization resolution), enabling precise control of table footprint and runtime memory access patterns (Zhang et al., 12 Oct 2025).

3. Mathematical Formulations and Indexing

Input–weight groupings and resultant LUTs are orchestrated by explicit indexing functions:

  • LLM Inference (Vec-LUT): Let gg denote group size, NN the simultaneous tokens, K′=K/gK'=K/g the LUT tile partitions:

O[m,0…N−1]=∑k=0K′−1T[k,idxm,k,0…N−1]O[m,0\dots N-1]=\sum_{k=0}^{K'-1} T[k, idx_{m,k}, 0\dots N-1]

where

AddrT(k,i,n)=((kâ‹…3g+i)â‹…N+n)\mathrm{Addr}_T(k,i,n) = ((k\cdot 3^g + i)\cdot N + n)

  • CNN Patch Vectorization: For a patch vector x∈Rdx\in\mathbb{R}^d, quantize via zj=round(xj/bj)z_j=\mathrm{round}(x_j/b_j), then compute a mixed-radix linear index and lookup the output vector.
  • Sparsity-Aware Encoding (TENET): Ternary vectors are grouped, and each group’s ternary weight pattern (wj)(w_j) is mapped to a minimal index tuple (GIdx,DIdx,SIdx)(GIdx,DIdx,SIdx), reducing table storage and lookup complexity for each group (Huang et al., 17 Sep 2025).

4. Memory Bandwidth, Cache Behavior, and System Efficiency

Vector LUT-centric layouts fundamentally alter the bottleneck profile of inference kernels:

  • Scalar-LUT kernels spend up to 47% of runtime in latency-constrained, random memory lookups (Table 1, (Li et al., 6 Dec 2025)).
  • Vector-LUT kernels reduce lookup overhead to <1% (Table 8, (Li et al., 6 Dec 2025)) by consolidating NN random accesses into one contiguous, large access, saturating available memory bandwidth.
  • Roofline analysis demonstrates that arithmetic intensity is increased ($2$ FLOP per load-byte), transitioning the kernel from being memory BW-bound at ≈0.4BpeakB_\text{peak} (scalar) to approaching BpeakB_\text{peak}.
  • Empirical results on ARM Neoverse V1 (L1=64 KiB, BW ≈\approx 200 GB/s) show kernel-level speedups of $2$–4×4\times and end-to-end model prefilling speedups up to 4.2×4.2\times in LLMs (Li et al., 6 Dec 2025).

5. Integration with Advanced Dataflows and Hardware Acceleration

Vector LUT-centric tensor layouts are integral to state-of-the-art accelerator software–hardware codesign:

  • Cache-Aware Tiling and Streamed Lookup: Tiling both LUT tables and weight tiles along the kk and nn axes ensures each tile fits into L1 cache. Streamed precompute fuses LUT generation with lookup, keeping the entire tile’s compute and memory traffic inside the cache, reducing DRAM traffic by a factor of ≈3 (Li et al., 6 Dec 2025).
  • Hierarchical Accumulation: INT16 tiles are used for intermediate accumulation to leverage higher-throughput SIMD paths—block-wise INT16 accumulations are finalized as INT32, exploiting hardware parallelism.
  • LUT Tensor Core: Tiles are elongated (e.g., M=2M=2, N=64N=64, K=4K=4) to maximize LUT sharing across NN outputs. Table symmetrization halves storage by exploiting parity (LUT[q]=−LUT[2K−1−q]\mathrm{LUT}[q]=-\mathrm{LUT}[2^K-1 - q]), while bit-serial MUX architectures allow diverse quantization schemes and custom instructions for efficient invocation (Mo et al., 12 Aug 2024).
  • TENET Sparse-Aware Dataflow: Grouped ternary weights, dynamic N:MN:M activation sparsity, and on-the-fly tile decompression modules maximize data reuse. Linear-projection-aware sparse attention dataflow pipelines LUT-based and high-precision GEMMs to balance memory and compute (Huang et al., 17 Sep 2025).
  • Vision Inference LVQ: LUTs for CNNs are cascaded in U-shaped multi-branch pools, increasing effective receptive field without combinatorially increasing any single table's size. All LUTs are packed in row-major order for highly parallel, bandwidth-efficient batch lookup (Zhang et al., 12 Oct 2025).

6. Implementation and Deployment Considerations

Practical instantiations of vector LUT-centric tensor layout exhibit the following characteristics:

  • Memory Footprint: Unified vector-LUT tables for N=32N=32, g=5g=5, and K=14436K=14436 require ≈43 MiB (reduced per tile to <64<64 KiB with tiling strategies) (Li et al., 6 Dec 2025).
  • SIMD Alignment: All tiles and tables are aligned to SIMD-line boundaries (AVX2: 64 B, NEON: 16 B); NtileN_\text{tile} is sized to match or be a multiple of SIMD width.
  • Offline/Runtime Model Integration: Model weights are transformed offline into packed, tile-contiguous layouts compatible with llama.cpp GGML files. At runtime, buffer allocation and operator calls are adapted to utilize the vector-LUT dataflow. No additional activation transpose is needed; activation packing is fused into LUT precompute.
  • Compact Code Footprint: Implementations require ∼50 lines of C with intrinsics for the vector kernel and ∼200 lines for the packer, facilitating straightforward integration into existing mGEMM and deep learning frameworks (Li et al., 6 Dec 2025).
  • Hardware PPA Implications: On Tensor Core-like hardware, vector-LUT layouts enable up to 6.93×6.93\times end-to-end inference speedup at iso-accuracy (LLAMA/OPT/BLOOM), 20.9×20.9\times compute density, and 11.2×11.2\times energy efficiency, while occupying only 38.3% of the area of a standard MAC-based Tensor Core (Mo et al., 12 Aug 2024).

7. Extensions in Vision and Sparse Architectures

Vector LUT-centric layouts are not limited to LLMs. In vision inference:

  • Lattice Vector Quantizer (LVQ): Quantization step-sizes are jointly optimized for task performance and table size under constrained memory, yielding Pareto-optimal trade-offs between speed and accuracy. U-shaped cascaded LUTs provide global context while preserving local table efficiency (Zhang et al., 12 Oct 2025).
  • Zero Aware/Sparsity Patterns: In ternary (and N:M sparse) LLM accelerators, LUT indices and dataflow are co-designed to skip zero or "inactive" groups, with only the necessary activation lanes streamed to downstream compute (Huang et al., 17 Sep 2025). Dynamic sparsity-aware routers and partial sum buffers maintain utilization near theoretical peak levels even in real-time, low-workload scenarios.

The vector LUT-centric tensor layout is the pivotal enabler for modern ultra-low-bit, bandwidth-bound neural network inference. Through judicious packing and fusion of token, feature, and quantization axes into a table-centric memory layout, and by aligning both software and hardware design to this structure, it delivers dramatic improvements in throughput, compute density, and energy efficiency across both general-purpose CPUs and custom accelerators (Li et al., 6 Dec 2025, Mo et al., 12 Aug 2024, Zhang et al., 12 Oct 2025, Huang et al., 17 Sep 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Vector LUT-Centric Tensor Layout.