Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast On-device LLM Inference with NPUs (2407.05858v2)

Published 8 Jul 2024 in cs.AI

Abstract: On-device inference for LLMs, driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present LLM.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. LLM.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, LLM.npu achieves 22.4x faster prefill speed and 30.7$\times$ energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, LLM.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Daliang Xu (9 papers)
  2. Hao Zhang (947 papers)
  3. Liming Yang (28 papers)
  4. Ruiqi Liu (51 papers)
  5. Gang Huang (86 papers)
  6. Mengwei Xu (62 papers)
  7. Xuanzhe Liu (59 papers)
Citations (5)