T-MAN: Enabling End-to-End Low-Bit LLM Inference on NPUs via Unified Table Lookup (2511.11248v1)

Published 14 Nov 2025 in cs.AR

Abstract: LLMs are increasingly deployed on customer devices. To support them, current devices are adopting SoCs (System on Chip) with NPUs (Neural Processing Unit) installed. Although high performance is expected, LLM inference on NPUs is slower than its CPU counterpart. The reason is that NPUs have poor performance on computations other than GEMM, like dequantization. Current works either disaggregate prefill on the NPUs and decoding on the CPUs, or put both on the NPUs but with an accuracy loss. To solve this issue, based on the insight that low-bit can enable target computation encoded within an acceptably sized table, we propose table lookup to subsume hardware operations otherwise unsupported. To realize this, we overcome the conflicting hardware behavior of prefill and decoding to design a unified table layout and tiling through (1) fused two-level table-based dequantization and (2) concurrency-hierarchy-guided tiling. Based on that, we implement the prefill phase by three-stage pipeline and map the table-lookup-based decoding to NPU's vector units. Results show 1.4x and 3.1x speedup for prefill and decoding respectively, and 84% energy savings compared to the baseline NPU methods. The code is available at https://github.com/microsoft/T-MAC/tree/main/t-man.

Summary

The paper introduces a unified table lookup for low-bit LLM inference that eliminates CPU fallback and improves NPU efficiency.
It fuses dequantization, GEMM, and GEMV operations into a pipelined process that achieves up to 15× speedup and 84% energy savings.
The approach supports flexible quantization formats while maintaining high accuracy, outperforming previous state-of-the-art methods.

T-MAN: Unified Table-Lookup for Low-Bit LLM Inference on NPUs

Introduction and Motivation

The T-MAN system addresses the inefficiencies encountered when deploying low-bit quantized LLMs on consumer devices equipped with Neural Processing Units (NPUs). Although NPUs promise high throughput for matrix operations, existing LLM inference workflows often resort to hybrid solutions—using the NPU for prompt processing (prefill) and the CPU for token generation (decoding). This is due to NPUs’ hardware specialization, which optimizes dense high-precision GEMMs but lacks efficient support for the memory-bound, element-wise operations (notably dequantization) inherent to low-bit decoding. T-MAN proposes a software solution leveraging unified table-lookup operations to enable efficient, accurate, end-to-end low-bit LLM execution entirely on NPUs, supporting state-of-the-art quantization formats without compromise on accuracy or energy.

Figure 1: T-MAN eliminates redundant weight copies and enables both prefill and decoding on the NPU via table lookup, avoiding CPU fallback and additional memory use.

System Design: Unified Table Lookup and Execution Pipeline

Challenges in NPU-based Low-Bit Inference

Current practices either (a) align quantization with NPU-native formats (e.g., per-channel INT4) at the cost of significant accuracy loss, or (b) partition workload across CPU and NPU, incurring excessive energy, memory overhead, and complexity due to duplicated weights (Figure 1). Existing bit-serial table-lookup solutions succeed in eliminating dequantization, yet suffer from suboptimal utilization of NPU memory hierarchies and vector/matrix units (cf. [t-mac], [luttensorcore]).

T-MAN’s Unified Table-Lookup Abstraction

The central insight of T-MAN is that in low-bit regimes, all necessary dequantization and multiplication results can be pre-computed and stored in compact lookup tables (LUTs), so that both prefill (GEMM) and decoding (GEMV) phases can be expressed as table-lookup-dominated computations:

For prefill, quantized weights are dequantized to FP16 or INT16 using efficient, fused LUTs before matrix multiplication on NPU matrix cores.
For decoding, bit-serial looked-up products are directly accumulated on vector cores, eliminating the need for floating-point conversion or dequantization at runtime.
Figure 2: Bit-serial table lookup implements GEMM by decomposing low-bit weights, transforming multiplications into indexed precomputed table retrievals.

T-MAN unifies these two paradigms with a single weight layout and precomputed tables, avoiding both data duplication and costly bit-shuffling or repacking.

Data Layout, Tiling, and Pipelining

Two-Level LUT-based Dequantization and Layout

T-MAN supports per-group and per-block quantization, critical for SOTA LLM inference accuracy. Its design fuses bit-repacking, integer-to-float conversion, and scaling/zero-point application into two sequential table lookups. This approach drastically reduces per-token dequantization overhead, as expensive floating-point math is replaced by LUT retrieval.

Figure 4: Fused two-level LUT dequantization enables a single memory-efficient table lookup to substitute for multiple bitwise and floating-point operations.

Unified Concurrency-Aware Tiling

T-MAN discovers a unified memory/compute tiling configuration efficient for both prefill (matrix core, GEMM) and decoding (vector core, GEMV):

Pipeline-level: Concurrent execution of DMA loading, vector-based dequantization, and matrix multiplication.
Thread-level: Weight tiles are sized and permuted to fit both the vector length (decoding) and matrix-tile shape (prefill).
SIMD-level: Weights are packed for optimal register and LUT usage, with layout designed to exploit on-chip memory (TCM) and limit cache result spills.
Figure 6: Thread-level tiling and loop orders harmonized across vector (decoding) and matrix (prefill) units.

Pipelined Execution to Amortize Overhead

T-MAN employs a three-stage pipeline: asynchronous DMA fetch, vector-core dequantization via LUT, and matrix-core multiplication. This design effectively overlaps communication and computation, minimizing NPU idle time.

Figure 8: DMA, vector dequantization, and matrix multiplication pipeline stages overlap to maximize NPU utilization and mask slow memory or dequantization steps.

Optimized Table Lookup Decoding

During decoding, the LUT-based approach is vectorized along the output channel, maximizing per-inference-token throughput by batching table lookups. T-MAN’s register spill buffer maps intermediate results to on-chip TCM, avoiding costly L2 cache spills. This strategy is optimized for wide vectors and deep tiling hierarchies found in modern NPUs.

Figure 10: T-MAN’s mapping of memory hierarchy for LUT decoding leverages on-chip memory for spill buffer to avoid slow cache interactions.

Implementation Details

Simd Table Lookup: Efficient use of NPU-specific VLUT instructions (HVX): T-MAN empirically determines VLUT16 is optimal in terms of throughput for both 8b and 16b activations.
DMA-based Prefetch: Data is transferred via DMA directly to TCM for both weights and inputs, outpacing vector cache prefetch methods.
Graph Optimization: Redundant GEMM operations (e.g., in multi-projection layers) are detected and pre-fused to maximize memory and computation reuse.
Figure 3: Graph optimization eliminates redundant pre-computation by sharing precursor results among sequential lookup kernel invocations.

Empirical Evaluation

T-MAN is evaluated on OnePlus smartphones with Snapdragon NPUs using Llama3, Qwen3, and BitNet models at INT2/INT4 precision.

Key results:

Prefill (GEMM): up to 1.4× speedup vs. prior SOTA (LLM.npu, T-MAC, QNN), matching or exceeding vendor-optimized kernels even for per-block quantization (Figure 12).
Decoding (GEMV): up to 3.1× speedup (and up to 8× kernel speedup) compared to QNN and 3.8× over LLM.npu, despite supporting richer quantization formats (Figure 13).
Energy savings: Up to 84% energy reduction compared to hybrid NPU-CPU solutions; 25% lower energy vs QNN for decoding, attributable to faster inference and exclusive use of the efficient NPU cores.
Figure 5: mpGEMM performance: T-MAN achieves comparable or superior throughput to QNN, outperforming NPU-CPU hybrids and CPU-only baselines.

Figure 7: Decoding throughput—T-MAN sustains up to 49.1 tokens/s on BitNet-2B, with substantial speedup over QNN and CPU/NPU baselines.

Figure 9: Prefill throughput—end-to-end, T-MAN achieves 15× speedup vs. CPU-only frameworks by leveraging efficient NPU hardware.

Ablation and Accuracy Study

LUT-dequantization brings 10× speedup over regular float dequant, 4.9× over loading pre-dequantized weights (Figure 15).
Pipeline execution yields 1.5× performance boost over non-overlapped execution (Figure 16).
Accuracy: T-MAN with per-block INT2 quantization outperforms QNN with per-channel INT4 in perplexity on standard evaluation datasets—48% perplexity reduction on Qwen3-8B and 32% on Llama-3.1-8B.

Practical Implications and Limitations

T-MAN demonstrates that table-lookup-based low-bit inference can bridge the flexibility-performance gap that has stymied deployment of quantized LLMs on highly specialized NPUs. It enables:

Unified model formats supporting SOTA quantization without data redundancy
End-to-end NPU execution for latency and energy improvement
Flexibility for future quantization research beyond vendor-constrained choices

However, T-MAN’s full benefits rely on programmable NPUs with sufficient register file and on-chip memory capacity. Applicability is reduced on platforms with only opaque high-level interfaces (e.g., Apple NPU family), or those lacking support for low-latency table lookup operations.

Theoretical and Future Directions

T-MAN demonstrates the value of software-hardware co-design, even in the absence of direct hardware support for arbitrary quantization formats. If NPU vendors expose matrix/tensor instructions for 2- and 4-bit formats, further kernel optimizations are possible, including batch-sequential and long-context improvements. As on-chip memory grows and register spills become even more expensive, hierarchical LUT and spill management strategies will become critical.

The demonstrated superiority of per-block/group quantization for LLMs—enabled by T-MAN’s design—should make future NPU architectures favor greater programmability and LUT-friendliness, broadening support for advanced quantization and compression strategies.

Conclusion

T-MAN achieves the first lossless, fully NPU-resident, low-bit LLM inference pipeline that supports arbitrary quantization strategies without loss of performance or accuracy. Its flexible, cue-efficient table-lookup mechanism and unified data management yield strong improvements in throughput and energy relative to vendor and academic baselines, positioning it as an essential reference for the next generation of LLM deployment systems on edge and mobile NPUs.