Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (2312.12456v2)

Published 16 Dec 2023 in cs.LG and cs.OS
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Abstract: This paper introduces PowerInfer, a high-speed LLM inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. The evaluation shows that PowerInfer significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU. For the OPT-30B model, PowerInfer achieves performance comparable to that of a high-end server-grade A100 GPU, reaching 82% of its token generation rate on a single consumer-grade RTX 4090 GPU.

Background and Current Challenges

LLMs have become critical tools in various applications, from creative writing to natural language processing. While LLMs have traditionally been run on powerful server-grade GPUs, the trend is shifting towards running them on personal computers with consumer-grade GPUs. The motivation behind this shift includes enhanced data privacy, the potential for model customization, and reduced costs. However, consumer GPUs face significant memory constraints when it comes to hosting the substantial parameter sets required by LLMs, making efficient local LLM inference an important yet challenging task.

PowerInfer: A Novel Inference Engine

PowerInfer introduces a novel GPU-CPU hybrid inference engine that embraces the locality of neuron activations in LLM inference. By differentiating between frequently activated 'hot' neurons and irregularly activated 'cold' neurons, PowerInfer can preload hot neurons onto the GPU for quick access. The design incorporates adaptive predictors to optimize neuron activation efficiency and employs neuron-aware sparse operators that interact with individual neurons, thus omitting unrequired operations on entire matrices. This methodology greatly utilizes available resources, minimizing the need for expansive data transfers between the GPU and CPU and enabling significantly faster inference speeds with maintained model accuracy.

Implementation and Compatibility

The online inference engine is realized by extending existing LLM frameworks with additional implementations in C++ and CUDA, while the offline component utilizes a Python-based profiler and solver to categorize neurons and construct a neuron placement policy. PowerInfer's flexible configuration supports a wide range of LLM families and GPU types, from the high-end NVIDIA RTX 4090 to the older RTX 2080Ti models. Notably, even on consumer-grade GPUs, PowerInfer achieves performance close to that of server-grade GPUs without sacrificing accuracy.

Evaluations and Insights

Performance evaluations show that PowerInfer outpaces existing alternatives, offering considerable speedups in token generation rates for quantized and non-quantized models. Moreover, PowerInfer maintains near-identical accuracy across various LLM models and tasks, ensuring that the efficiency gains are not at the expense of performance quality.

Conclusion

The paper presents PowerInfer, an inference system that harnesses the power-law distribution in neuron activation to optimize the efficiency of local LLM deployments. By strategically splitting the workload between GPU and CPU and focusing on computational locality, PowerInfer affirms its potential as a solution to the challenge of running LLMs on personal computers effectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
  2. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  3. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  4. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa, 2023.
  7. Accelerating large language model decoding with speculative sampling, 2023.
  8. Optimizing dynamic neural networks with brainstorm. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 797–815, Boston, MA, July 2023. USENIX Association.
  9. Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 389–402, 2021.
  10. Wikimedia Foundation. Wikimedia downloads.
  11. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
  12. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
  13. Breaking the sequential dependency of llm inference using lookahead decoding, November 2023.
  14. Georgi Gerganov. ggerganov/llama.cpp: Port of facebook’s llama model in c/c++. https://github.com/ggerganov/llama.cpp, 2023.
  15. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9, 2007.
  16. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  17. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1135–1143, Cambridge, MA, USA, 2015. MIT Press.
  18. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
  19. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations, 2022.
  20. Do emergent abilities exist in quantized large language models: An empirical study, 2023.
  21. Deja vu: Contextual sparsity for efficient LLMs at inference time. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 22137–22176. PMLR, 23–29 Jul 2023.
  22. Llm-rec: Personalized recommendation via prompting large language models, 2023.
  23. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
  24. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023.
  25. PrivateGPT, May 2023.
  26. Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
  27. Relu strikes back: Exploiting activation sparsity in large language models, 2023.
  28. MohamedRashad. https://huggingface.co/datasets/MohamedRashad/ChatGPT-prompts, 2023.
  29. NVIDIA. Unified memory programming. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd, 2021.
  30. NVIDIA. cuSPARSE: Basic Linear Algebra for Sparse Matrices on NVIDIA GPUs. https://developer.nvidia.com/cusparse, 2023.
  31. OpenAI. https://openai.com/blog/chatgpt, 2023.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  33. Google Research. Sputnik: a library of sparse linear algebra kernels and utilities for deep learning. https://github.com/google-research/sputnik, 2023.
  34. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  35. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  36. S-lora: Serving thousands of concurrent lora adapters, 2023.
  37. Flexgen: High-throughput generative inference of large language models with a single gpu. 2023.
  38. SparseLLM. Relufalcon-40b. https://huggingface.co/SparseLLM/ReluFalcon-40B.
  39. SparseLLM. Relullama-70b. https://huggingface.co/SparseLLM/ReluLLaMA-70B.
  40. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  41. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  42. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  43. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23, page 233–248, New York, NY, USA, 2023. Association for Computing Machinery.
  44. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
  45. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity, 2023.
  46. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  47. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
  48. Distilling script knowledge from large language models for constrained language planning. arXiv preprint arXiv:2305.05252, 2023.
  49. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  50. MoEfication: Transformer feed-forward layers are mixtures of experts. In Findings of ACL 2022, 2022.
  51. Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 331–347, New York, NY, USA, 2023. Association for Computing Machinery.
  52. SparTA: Deep-Learning model sparsity via Tensor-with-Sparsity-Attribute. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 213–232, Carlsbad, CA, July 2022. USENIX Association.
  53. PetS: A unified framework for Parameter-Efficient transformers serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 489–504, Carlsbad, CA, July 2022. USENIX Association.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yixin Song (6 papers)
  2. Zeyu Mi (7 papers)
  3. Haotong Xie (2 papers)
  4. Haibo Chen (93 papers)
Citations (70)
Github Logo Streamline Icon: https://streamlinehq.com