Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

23 1

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (2312.12456v2)

Published 16 Dec 2023 in cs.LG and cs.OS

Abstract: This paper introduces PowerInfer, a high-speed LLM inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. The evaluation shows that PowerInfer significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU. For the OPT-30B model, PowerInfer achieves performance comparable to that of a high-end server-grade A100 GPU, reaching 82% of its token generation rate on a single consumer-grade RTX 4090 GPU.

PDF HTML Abstract

Background and Current Challenges

LLMs have become critical tools in various applications, from creative writing to natural language processing. While LLMs have traditionally been run on powerful server-grade GPUs, the trend is shifting towards running them on personal computers with consumer-grade GPUs. The motivation behind this shift includes enhanced data privacy, the potential for model customization, and reduced costs. However, consumer GPUs face significant memory constraints when it comes to hosting the substantial parameter sets required by LLMs, making efficient local LLM inference an important yet challenging task.

PowerInfer: A Novel Inference Engine

PowerInfer introduces a novel GPU-CPU hybrid inference engine that embraces the locality of neuron activations in LLM inference. By differentiating between frequently activated 'hot' neurons and irregularly activated 'cold' neurons, PowerInfer can preload hot neurons onto the GPU for quick access. The design incorporates adaptive predictors to optimize neuron activation efficiency and employs neuron-aware sparse operators that interact with individual neurons, thus omitting unrequired operations on entire matrices. This methodology greatly utilizes available resources, minimizing the need for expansive data transfers between the GPU and CPU and enabling significantly faster inference speeds with maintained model accuracy.

Implementation and Compatibility

The online inference engine is realized by extending existing LLM frameworks with additional implementations in C++ and CUDA, while the offline component utilizes a Python-based profiler and solver to categorize neurons and construct a neuron placement policy. PowerInfer's flexible configuration supports a wide range of LLM families and GPU types, from the high-end NVIDIA RTX 4090 to the older RTX 2080Ti models. Notably, even on consumer-grade GPUs, PowerInfer achieves performance close to that of server-grade GPUs without sacrificing accuracy.

Evaluations and Insights

Performance evaluations show that PowerInfer outpaces existing alternatives, offering considerable speedups in token generation rates for quantized and non-quantized models. Moreover, PowerInfer maintains near-identical accuracy across various LLM models and tasks, ensuring that the efficiency gains are not at the expense of performance quality.

Conclusion

The paper presents PowerInfer, an inference system that harnesses the power-law distribution in neuron activation to optimize the efficiency of local LLM deployments. By strategically splitting the workload between GPU and CPU and focusing on computational locality, PowerInfer affirms its potential as a solution to the challenge of running LLMs on personal computers effectively.

PDF Markdown Bookmark Chat (Pro)

References (53)

Authors (4)

Yixin Song (6 papers)
Zeyu Mi (7 papers)
Haotong Xie (2 papers)
Haibo Chen (93 papers)

Citations (70)

View on Semantic Scholar

GitHub

GitHub - SJTU-IPADS/PowerInfer: High-speed Large Language Model Serving on PCs with Consumer-grade GPUs (7,063 stars)

Tweets

https://twitter.com/IlyasHairline/status/1798589741652447345

https://twitter.com/878228318977503233/status/1737799378612445562

https://twitter.com/22146921/status/1737954671623606597

https://twitter.com/IlyasHairline/status/1785406223547760725

https://twitter.com/123543935/status/1737682846263693505

https://twitter.com/3376145139/status/1741020556345098376