Fast On-device LLM Inference with NPUs (2407.05858v2)
Abstract: On-device inference for LLMs, driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present LLM.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. LLM.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, LLM.npu achieves 22.4x faster prefill speed and 30.7$\times$ energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, LLM.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.
- Daliang Xu (9 papers)
- Hao Zhang (947 papers)
- Liming Yang (28 papers)
- Ruiqi Liu (51 papers)
- Gang Huang (86 papers)
- Mengwei Xu (62 papers)
- Xuanzhe Liu (59 papers)