HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices (2403.01164v1)
Abstract: In recent times, the emergence of LLMs has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.
- Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022a.
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale, June 2022b. URL http://arxiv.org/abs/2207.00032. arXiv:2207.00032 [cs].
- Language Models are Few-Shot Learners, July 2020. URL http://arxiv.org/abs/2005.14165. arXiv:2005.14165 [cs].
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022a.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, June 2022b. URL http://arxiv.org/abs/2205.14135. arXiv:2205.14135 [cs].
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, November 2022. URL http://arxiv.org/abs/2208.07339. arXiv:2208.07339 [cs].
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
- Energonai: An inference system for 10-100 billion parameter transformer models, 2022.
- Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 389–402, 2021.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Drive Like a Human: Rethinking Autonomous Driving with Large Language Models, July 2023. URL http://arxiv.org/abs/2307.07162. arXiv:2307.07162 [cs].
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, July 2019. URL http://arxiv.org/abs/1811.06965. arXiv:1811.06965 [cs].
- Efficient Memory Management for Large Language Model Serving with PagedAttention, September 2023. URL http://arxiv.org/abs/2309.06180. arXiv:2309.06180 [cs].
- Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing, October 2019. URL http://arxiv.org/abs/1910.05316. arXiv:1910.05316 [cs].
- Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pp. 2736–2744, 2017.
- Training language models to follow instructions with human feedback, March 2022. URL https://arxiv.org/abs/2203.02155v1.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library, December 2019. URL http://arxiv.org/abs/1912.01703. arXiv:1912.01703 [cs, stat].
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, May 2020. URL http://arxiv.org/abs/1910.02054. arXiv:1910.02054 [cs, stat].
- ZeRO-Offload: Democratizing Billion-Scale Model Training, January 2021. URL http://arxiv.org/abs/2101.06840. arXiv:2101.06840 [cs].
- The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pp. 491–514, March 2023. doi: 10.1145/3581641.3584037. URL http://arxiv.org/abs/2302.07080. arXiv:2302.07080 [cs].
- High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, March 2020. URL http://arxiv.org/abs/1909.08053. arXiv:1909.08053 [cs].
- Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
- LLaMA: Open and Efficient Foundation Language Models, February 2023. URL http://arxiv.org/abs/2302.13971. arXiv:2302.13971 [cs].
- Attention Is All You Need, December 2017. URL http://arxiv.org/abs/1706.03762. arXiv:1706.03762 [cs].
- Emergent Abilities of Large Language Models, October 2022. URL http://arxiv.org/abs/2206.07682. arXiv:2206.07682 [cs].
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, March 2023. URL http://arxiv.org/abs/2211.05100. arXiv:2211.05100 [cs].
- Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
- OPT: Open Pre-trained Transformer Language Models, June 2022. URL http://arxiv.org/abs/2205.01068. arXiv:2205.01068 [cs].