Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distributed Inference and Fine-tuning of Large Language Models Over The Internet (2312.08361v1)

Published 13 Dec 2023 in cs.LG and cs.DC
Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Abstract: LLMs are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

Distributed Inference and Fine-tuning of LLMs

Introduction

The deployment and utilization of LLMs with over 50 billion parameters in various NLP tasks have been constrained by the requirement for high-end hardware. Traditional methods, like offloading parameters to RAM, do not suffice as they are inefficient for applications such as chatbots and search engines which are latency-sensitive. An alternative strategy to overcome these challenges involves employing distributed computing over the internet, using a swarm of unreliable devices to run these LLMs, which is the main focus of this paper.

Fault-Tolerance in Model Inference

The paper introduces advanced algorithms tailored for distributed environments where devices can be unreliable and have variable network latencies. Through the development of a novel fault-tolerant autoregressive inference algorithm and a decentralized load-balancing mechanism, the authors established means to recover quickly from server failures. The fault-tolerance comes from maintaining dual attention caches that facilitate rapid server state restoration by other standby servers. This approach minimizes the volume of re-transmitted data to only what's necessary when a failure occurs.

Load Balancing and Fine-Tuning

Furthermore, the paper tackles the dynamic and uneven nature of consumer-grade hardware and network resources by devising a load-balancing protocol. This adaptive mechanism assigns transformer blocks across the distributed system to optimize overall throughput, despite servers joining or leaving freely. The system also supports parameter-efficient fine-tuning methods where clients, not servers, store and update their trainable parameters - adapting to various tasks without heavily taxing the network.

Performance Evaluation

Extensive simulations and real-world experiments confirmed that the introduced system can execute LLMs efficiently over the internet. When compared to local offloading, the approach exhibited up to a tenfold increase in speed for interactive generation tasks. Tests spanned different continents, asserting the system's robustness and efficiency despite geodistribution challenges.

Conclusion

The paper concludes by validating the proposed decentralized system as a cost-effective alternative for using LLMs on distributed, unreliable devices. It leverages the collective power of idle compute resources while guaranteeing correct model outputs and promising significant improvements in speed over traditional offloading methods. The authors also call attention to privacy considerations and potential future improvements like integrating secure multi-party computations to safeguard sensitive data processed by the system.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. AI21. Jurassic-1 language models. "https://studio.ai21.com/docs/jurassic1-language-models". Accessed: 2022-06-22.
  2. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022.
  3. Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pp.  472–487, 2022.
  4. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Comput. Surv., 52(4), aug 2019. ISSN 0360-0300. doi: 10.1145/3320060. URL https://doi.org/10.1145/3320060.
  5. BigScience. BLOOM: a 176B-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022a.
  6. BigScience. A version of BLOOM with 7.1 billion parameters. https://huggingface.co/bigscience/bloom-7b1, 2022b.
  7. The fork of Megatron-LM and Megatron-DeepSpeed by BigScience. https://github.com/bigscience-workshop/Megatron-DeepSpeed, 2022.
  8. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.
  9. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  10. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  11. LLM.int8(): 8-bit matrix multiplication for transformers at scale. ArXiv, abs/2208.07339, 2022a.
  12. 8-bit optimizers via block-wise quantization. International Conference on Learning Representations (ICLR), 2022b.
  13. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  14. Glam: Efficient scaling of language models with mixture-of-experts. CoRR, abs/2112.06905, 2021. URL https://arxiv.org/abs/2112.06905.
  15. A pragmatic introduction to secure multi-party computation. Foundations and Trends in Privacy and Security, 2(2-3):70–246, 2018.
  16. Face, H. and contributors. Accelerate: Run your raw pytorch training script on any kind of device. GitHub. Note: https://github.com/huggingface/datasets, 1, 2020.
  17. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021.
  18. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  19. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS), 26(1):19–45, 2000.
  20. Parameter-efficient transfer learning with diff pruning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021.
  21. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
  22. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  23. Lora: Low-rank adaptation of large language models, 2021.
  24. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pp.  103–112, 2019.
  25. Beyond data and model parallelism for deep neural networks. In Talwalkar, A., Smith, V., and Zaharia, M. (eds.), Proceedings of Machine Learning and Systems, volume 1, pp.  1–13, 2019. URL https://proceedings.mlsys.org/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf.
  26. Scaling laws for neural language models, 2020.
  27. Yalm 100b, 2022. "https://huggingface.co/yandex/yalm-100b".
  28. What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers. CoRR, abs/2109.04650, 2021. URL https://arxiv.org/abs/2109.04650.
  29. Fast replanning for navigation in unknown terrain. IEEE Transactions on Robotics, 21(3):354–363, 2005. doi: 10.1109/TRO.2004.838026.
  30. Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. URL http://arxiv.org/abs/1404.5997.
  31. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp.  1097–1105. Curran Associates, Inc., 2012.
  32. Kuszmaul, J. Bamboo trimming revisited: Simple algorithms can do well too. arXiv preprint arXiv:2201.07350, 2022.
  33. Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2020.
  34. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
  35. libp2p. libp2p circuit relay. https://docs.libp2p.io/concepts/nat/circuit-relay/, 2022.
  36. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022a. URL https://arxiv.org/abs/2205.05638.
  37. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022b. URL https://arxiv.org/abs/2205.05638.
  38. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021a.
  39. Gpt understands, too. arXiv:2103.10385, 2021b.
  40. Kademlia: A peer-to-peer information system based on the xor metric. In International Workshop on Peer-to-Peer Systems, pp.  53–65. Springer, 2002.
  41. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, pp.  1–15, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368735. doi: 10.1145/3341301.3359646. URL https://doi.org/10.1145/3341301.3359646.
  42. Efficient large-scale language model training on gpu clusters. arXiv preprint arXiv:2104.04473, 2021.
  43. NVIDIA. NVIDIA Ampere GA102 GPU architecture, 2020. URL https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf.
  44. NVIDIA. Nvidia confidential computing. https://www.nvidia.com/en-in/data-center/solutions/confidential-computing/, 2022.
  45. Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645, 2020.
  46. Improving language understanding by generative pre-training. 2018. URL https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
  47. Language models are unsupervised multitask learners. 2019.
  48. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
  49. Zero: Memory optimization towards training a trillion parameter models. In SC, 2020.
  50. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. arXiv preprint arXiv:2104.07857, 2021.
  51. Zero-offload: Democratizing billion-scale model training, 2021.
  52. SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  29416–29440. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/ryabinin23a.html.
  53. Generating datasets with pretrained language models. pp.  6943–6951, November 2021. doi: 10.18653/v1/2021.emnlp-main.555. URL https://aclanthology.org/2021.emnlp-main.555.
  54. Mesh-tensorflow: Deep learning for supercomputers. CoRR, abs/1811.02084, 2018. URL http://arxiv.org/abs/1811.02084.
  55. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 2021.
  56. Communication-efficient distributed deep learning: A comprehensive survey, 2020.
  57. Galactica: A large language model for science. 2022.
  58. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  59. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  60. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp.  5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
  61. Fine-tuning language models over slow networks using activation compression with guarantees, 2022. URL https://arxiv.org/abs/2206.01299.
  62. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178, 2021.
  63. Pipemare: Asynchronous pipeline parallel dnn training. ArXiv, abs/1910.05124, 2019.
  64. Adapting bigscience multilingual model to unseen languages, 2022. URL https://arxiv.org/abs/2204.04873.
  65. Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems, 35:25464–25477, 2022.
  66. GLM-130B: An open bilingual pre-trained model, 2022. URL http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/.
  67. Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. CoRR, abs/2104.12369, 2021. URL https://arxiv.org/abs/2104.12369.
  68. OPT: open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Alexander Borzunov (7 papers)
  2. Max Ryabinin (29 papers)
  3. Artem Chumachenko (3 papers)
  4. Dmitry Baranchuk (23 papers)
  5. Tim Dettmers (22 papers)
  6. Younes Belkada (9 papers)
  7. Pavel Samygin (2 papers)
  8. Colin Raffel (83 papers)
Citations (27)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com