Fast On-device LLM Inference with NPUs (2407.05858v2)

Published 8 Jul 2024 in cs.AI

Abstract: On-device inference for LLMs, driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present LLM.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. LLM.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, LLM.npu achieves 22.4x faster prefill speed and 30.7$\times$ energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, LLM.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (7)

Daliang Xu (9 papers)
Hao Zhang (947 papers)
Liming Yang (28 papers)
Ruiqi Liu (51 papers)
Gang Huang (86 papers)
Mengwei Xu (62 papers)
Xuanzhe Liu (59 papers)

Citations (5)

View on Semantic Scholar

Tweets

https://twitter.com/realmofresearch/status/1811987403055464568

https://twitter.com/midwit_capital/status/1905530810448441567

https://twitter.com/midwit_capital/status/1893381513531584943

Fast On-device LLM Inference with NPUs (2407.05858v2)

Related Papers

Tweets