Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning (2402.18789v1)

Published 29 Feb 2024 in cs.DC, cs.CL, and cs.LG

Abstract: Parameter-efficient finetuning (PEFT) is a widely used technique to adapt LLMs for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Huggingface Models. https://huggingface.co/models, 2023.
  2. NVIDIA MIG Partitioning Limitations and Resetting. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html##partitioning, 2023.
  3. NVIDIA Multi-Instance GPU (MIG). https://www.nvidia.com/en-us/technologies/multi-instance-gpu/, 2023.
  4. NVIDIA Multi-Process Service (MPS). https://docs.nvidia.com/deploy/mps/index.html, 2023.
  5. Perlmutter supercomputer. https://docs.nersc.gov/systems/perlmutter/architecture/, 2023.
  6. Optimizing inference serving on serverless platforms. Proceedings of the VLDB Endowment, 15(10), 2022.
  7. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  8. {{\{{PipeSwitch}}\}}: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 499–514, 2020.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Punica: Multi-tenant lora serving. arXiv preprint arXiv:2310.18547, 2023.
  11. Deepboot: Dynamic scheduling system for training and inference deep learning tasks in gpu cluster. IEEE Transactions on Parallel and Distributed Systems, 2023.
  12. Serving heterogeneous machine learning models on {{\{{Multi-GPU}}\}} servers with {{\{{Spatio-Temporal}}\}} sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 199–216, 2022.
  13. Prema: A predictive multi-task scheduling algorithm for preemptible neural processing units. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 220–233. IEEE, 2020.
  14. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  15. Clipper: A {{\{{Low-Latency}}\}} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017.
  16. Flash-decoding for long-context inference, 2023.
  17. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  18. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  19. Gslice: controlled spatial sharing of gpus for a scalable inference platform. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 492–506, 2020.
  20. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  22. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
  23. Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision. arXiv preprint arXiv:2205.11913, 2022.
  24. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
  25. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041–1057, 2022.
  26. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  27. Robin J Hogan. Fast reverse-mode automatic differentiation using expression templates in c++. ACM Transactions on Mathematical Software (TOMS), 40(4):1–16, 2014.
  28. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  29. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  30. Checkmate: Breaking the memory wall with optimal tensor rematerialization. In Proceedings of Machine Learning and Systems 2020, pages 497–511. 2020.
  31. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  32. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
  33. {{\{{PRETZEL}}\}}: Opening the black box of machine learning prediction serving systems. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 611–626, 2018.
  34. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  35. Lyra: Elastic scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 835–850, 2023.
  36. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  37. {{\{{AlpaServe}}\}}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023.
  38. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  39. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
  40. Gpt understands, too. AI Open, 2023.
  41. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  42. Unipelt: A unified framework for parameter-efficient language model tuning. arXiv preprint arXiv:2110.07577, 2021.
  43. Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv preprint arXiv:2312.15234, 2023.
  44. Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
  45. Spotserve: Serving generative large language models on preemptible instances. ASPLOS, 2024.
  46. Cheaply estimating inference efficiency metrics for autoregressive transformer models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  47. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  48. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
  49. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  51. {{\{{INFaaS}}\}}: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411, 2021.
  52. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  53. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  54. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676, 2020.
  55. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 205–218. USENIX Association, July 2020.
  56. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 322–337, 2019.
  57. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.
  58. Flexgen: high-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  59. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019.
  60. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  61. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
  62. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  63. Llama 2: Open foundation and fine-tuned chat models, 2023.
  64. Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 267–284. USENIX Association, 2022.
  65. Transformers: State-of-the-art machine learning for pytorch, tensorflow, and jax. https://github.com/huggingface/transformers, 2022.
  66. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  67. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595–610, 2018.
  68. {{\{{AntMan}}\}}: Dynamic scaling on {{\{{GPU}}\}} clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548, 2020.
  69. A survey of multi-tenant deep learning inference on gpu. arXiv preprint arXiv:2203.09040, 2022.
  70. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  71. {{\{{MArk}}\}}: Exploiting cloud services for {{\{{Cost-Effective}}\}},{{\{{SLO-Aware}}\}} machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 1049–1062, 2019.
  72. Pretraining-based natural language generation for text summarization. arXiv preprint arXiv:1902.09243, 2019.
  73. SHEPHERD: Serving DNNs in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 787–808, Boston, MA, April 2023. USENIX Association.
  74. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  75. Muxflow: Efficient and safe gpu sharing in large-scale production deep learning clusters. arXiv preprint arXiv:2303.13803, 2023.
  76. Autopeft: Automatic configuration search for parameter-efficient fine-tuning. arXiv preprint arXiv:2301.12132, 2023.
  77. {{\{{PetS}}\}}: A unified framework for {{\{{Parameter-Efficient}}\}} transformers serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 489–504, 2022.
  78. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xupeng Miao (37 papers)
  2. Gabriele Oliaro (10 papers)
  3. Xinhao Cheng (4 papers)
  4. Mengdi Wu (5 papers)
  5. Colin Unger (2 papers)
  6. Zhihao Jia (43 papers)
Citations (4)