Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices (2403.01164v1)

Published 2 Mar 2024 in cs.PF and cs.DC

Abstract: In recent times, the emergence of LLMs has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022a.
  2. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale, June 2022b. URL http://arxiv.org/abs/2207.00032. arXiv:2207.00032 [cs].
  3. Language Models are Few-Shot Learners, July 2020. URL http://arxiv.org/abs/2005.14165. arXiv:2005.14165 [cs].
  4. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022a.
  5. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, June 2022b. URL http://arxiv.org/abs/2205.14135. arXiv:2205.14135 [cs].
  6. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, November 2022. URL http://arxiv.org/abs/2208.07339. arXiv:2208.07339 [cs].
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
  8. Energonai: An inference system for 10-100 billion parameter transformer models, 2022.
  9. Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.  389–402, 2021.
  10. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  11. Drive Like a Human: Rethinking Autonomous Driving with Large Language Models, July 2023. URL http://arxiv.org/abs/2307.07162. arXiv:2307.07162 [cs].
  12. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  13. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, July 2019. URL http://arxiv.org/abs/1811.06965. arXiv:1811.06965 [cs].
  14. Efficient Memory Management for Large Language Model Serving with PagedAttention, September 2023. URL http://arxiv.org/abs/2309.06180. arXiv:2309.06180 [cs].
  15. Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing, October 2019. URL http://arxiv.org/abs/1910.05316. arXiv:1910.05316 [cs].
  16. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pp.  2736–2744, 2017.
  17. Training language models to follow instructions with human feedback, March 2022. URL https://arxiv.org/abs/2203.02155v1.
  18. PyTorch: An Imperative Style, High-Performance Deep Learning Library, December 2019. URL http://arxiv.org/abs/1912.01703. arXiv:1912.01703 [cs, stat].
  19. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, May 2020. URL http://arxiv.org/abs/1910.02054. arXiv:1910.02054 [cs, stat].
  20. ZeRO-Offload: Democratizing Billion-Scale Model Training, January 2021. URL http://arxiv.org/abs/2101.06840. arXiv:2101.06840 [cs].
  21. The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pp.  491–514, March 2023. doi: 10.1145/3581641.3584037. URL http://arxiv.org/abs/2302.07080. arXiv:2302.07080 [cs].
  22. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
  23. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, March 2020. URL http://arxiv.org/abs/1909.08053. arXiv:1909.08053 [cs].
  24. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  25. LLaMA: Open and Efficient Foundation Language Models, February 2023. URL http://arxiv.org/abs/2302.13971. arXiv:2302.13971 [cs].
  26. Attention Is All You Need, December 2017. URL http://arxiv.org/abs/1706.03762. arXiv:1706.03762 [cs].
  27. Emergent Abilities of Large Language Models, October 2022. URL http://arxiv.org/abs/2206.07682. arXiv:2206.07682 [cs].
  28. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, March 2023. URL http://arxiv.org/abs/2211.05100. arXiv:2211.05100 [cs].
  29. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  30. OPT: Open Pre-trained Transformer Language Models, June 2022. URL http://arxiv.org/abs/2205.01068. arXiv:2205.01068 [cs].
Citations (3)

Summary

We haven't generated a summary for this paper yet.