Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (2303.06865v2)

Published 13 Mar 2023 in cs.LG, cs.AI, and cs.PF
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Abstract: The high computational and memory requirements of LLM inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen

FlexGen: High-Throughput Generative Inference of LLMs with a Single GPU

The paper investigates FlexGen, a sophisticated framework designed to optimize generative inference of LLMs, particularly under constraints of limited computational resources such as a single GPU. Given the expansive computational and memory demands inherent to LLMs, FlexGen represents a strategic response to facilitate efficient inference without requiring multiple high-end accelerators.

Key Contributions

  1. Offloading Framework: FlexGen capitalizes on an offloading strategy that integrates computational capacities across GPU, CPU, and disk. By formulating the inference process as a graph traversal problem, FlexGen identifies optimal storage and compute strategies to minimize execution time. Emphasizing throughput-oriented workloads, FlexGen utilizes a zig-zag block computation schedule. This approach enables substantial weight reuse, thus enhancing I/O efficiency and supporting larger batch sizes.
  2. Quantization and Compression: One of the key innovations lies in compressing both model weights and the KV cache to 4-bit quantized formats. This compression achieves significant reductions in memory usage and I/O operations while maintaining an accuracy that is nearly equivalent to FP16 implementations. This strategy aligns with FlexGen’s goal to maximize batch size and throughput on commodity hardware.
  3. Performance Benchmarking: The performance of FlexGen is empirically validated through rigorous comparisons with prevailing systems like DeepSpeed ZeRO-Inference and Hugging Face Accelerate. When deployed on a single T4 GPU with constrained RAM and SSD capacity, FlexGen demonstrated its capability to execute inference on the OPT-175B model with unprecedented throughput, significantly outmatching existing offloading solutions. FlexGen achieved a throughput of 1 token per second using a 16GB GPU, showcasing a potential throughput increase of approximately 100 times relative to baseline systems.

Theoretical and Practical Implications

FlexGen’s methodologies elucidate a clear path forward in leveraging multi-tiered memory systems for LLM inference. By amalgamating memory resources across devices via optimized scheduling and compression techniques, FlexGen addresses both computational and memory bottlenecks effectively. The practical upshot is a scalable inference framework that reduces reliance on premium hardware without sacrificing performance.

Moreover, this approach offers substantial theoretical implications for future AI systems concerning the balance between hardware utilization and application-level throughput. By supporting latency-insensitive, batch processing tasks, FlexGen aligns well with real-world AI deployments that prioritize throughput over individual request latency—such as document processing and model benchmarking.

Future Directions

Building upon its foundations, future developments could explore integrating FlexGen with decentralized inference approaches, balancing between collaborative computing and offloading. Additionally, enhanced scheduling algorithms could be further refined to achieve greater efficiency as hardware architectures evolve.

In conclusion, FlexGen’s strategies provide a striking advancement in optimizing LLM inference within constrained environments. By addressing both theoretical complexities and practical deployment challenges, FlexGen sets a new benchmark for high-throughput AI inference systems. This work not only broadens the applicability of LLMs but also invites further exploration into efficient computation techniques in AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Deepspeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp.  646–660. IEEE Computer Society, 2022.
  2. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  3. Petals: Collaborative inference and fine-tuning of large models. arXiv preprint arXiv:2209.01188, 2022.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Spreadsheetcoder: Formula prediction from semi-structured context. In International Conference on Machine Learning, pp. 1661–1672. PMLR, 2021.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Demmel, J. Communication-avoiding algorithms for linear algebra and beyond. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.  585–585. IEEE, 2013.
  8. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  9. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
  10. Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.  389–402, 2021.
  11. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  12. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
  13. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  14. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1–124, 2021.
  15. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1341–1355, 2020.
  16. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  17. HuggingFace. Hugging face accelerate. https://huggingface.co/docs/accelerate/index, 2022.
  18. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
  19. I/o complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pp.  326–333, 1981.
  20. Alphatuning: Quantization-aware parameter-efficient adaptation of large-scale pre-trained language models. arXiv preprint arXiv:2210.03858, 2022.
  21. Harmony: Overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers. arXiv preprint arXiv:2202.01306, 2022.
  22. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  23. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  24. Morton, A. Pagecachemangagement. https://code.google.com/archive/p/pagecache-mangagement/source/default/source, 2008.
  25. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911, 2022.
  26. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  27. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15, 2021.
  28. NVIDIA. Fastertransformer. https://github.com/NVIDIA/FasterTransformer, 2022.
  29. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1525–1534, 2016.
  30. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  31. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  32. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022.
  33. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–14, 2021.
  34. Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pp.  551–564, 2021.
  35. Swarm parallelism: Training large models can be surprisingly communication-efficient. arXiv preprint arXiv:2301.11913, 2023.
  36. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  37. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  8815–8821, 2020.
  38. Olla: Optimizing the lifetime and location of arrays to reduce the memory usage of neural networks. 2022. doi: 10.48550/arXiv.2210.12924.
  39. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp.  41–53, 2018.
  40. Lightseq: A high performance inference library for transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pp.  113–120, 2021.
  41. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  42. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
  43. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp.  521–538, 2022.
  44. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  45. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Ying Sheng (31 papers)
  2. Lianmin Zheng (34 papers)
  3. Binhang Yuan (45 papers)
  4. Zhuohan Li (29 papers)
  5. Max Ryabinin (29 papers)
  6. Daniel Y. Fu (25 papers)
  7. Zhiqiang Xie (15 papers)
  8. Beidi Chen (61 papers)
  9. Clark Barrett (86 papers)
  10. Joseph E. Gonzalez (167 papers)
  11. Percy Liang (239 papers)
  12. Christopher Ré (194 papers)
  13. Ion Stoica (177 papers)
  14. Ce Zhang (215 papers)
Citations (270)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets