Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unicron: Economizing Self-Healing LLM Training at Scale (2401.00134v1)

Published 30 Dec 2023 in cs.DC and cs.LG

Abstract: Training large-scale LLMs is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale LLM training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale LLM training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. {{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
  2. Amazon Web Services. Amazon Web Services. https://aws.amazon.com/.
  3. Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022.
  4. David Bernstein. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, 1(3):81–84, 2014.
  5. Llm-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. arXiv preprint arXiv:2305.13614, 2023.
  6. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  7. Alibaba Cloud. Alibaba cloud: Reliable and secure cloud computing services. https://www.alibabacloud.com/.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. {{\{{Check-N-Run}}\}}: A checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 929–943, 2022.
  10. Jinze Bai et al. Qwen technical report, 2023.
  11. etcd. etcd. https://etcd.io/.
  12. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
  13. glm. glm. https://keg.cs.tsinghua.edu.cn/yuxiao/papers/slides-2023-ijcai-llm-glm-130-chatglm-agentbench.pdf.
  14. Google. Google cloud platform. https://cloud.google.com/.
  15. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  16. Elastic resource sharing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 721–739. USENIX Association, April 2021.
  17. Elastic resource sharing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 721–739, 2021.
  18. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
  19. Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 382–395, 2023.
  20. Andreas Jungherr. Using chatgpt and other large language model (llm) applications for academic paper assignments. 2023.
  21. Amazon sagemaker model parallelism: A general and flexible framework for large model training. arXiv preprint arXiv:2111.05972, 2021.
  22. Bpipe: Memory-balanced pipeline parallelism for training large language models. 2023.
  23. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
  24. Lyra: Elastic scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23, pages 835–850, New York, NY, USA, 2023. Association for Computing Machinery.
  25. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
  26. Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120, 2021.
  27. KungFu: Making training in distributed machine learning adaptive. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 937–954. USENIX Association, November 2020.
  28. Microsoft Azure. Microsoft Azure. https://azure.microsoft.com/.
  29. MLPerf. MLPerf. https://www.mlperf.org/.
  30. MLPerf v3.1 NVIDIA Submission. MLPerf v3.1 NVIDIA Submission. https://github.com/mlcommons/training_results_v3.1/tree/main/NVIDIA.
  31. {{\{{CheckFreq}}\}}: Frequent,{{\{{Fine-Grained}}\}}{{\{{DNN}}\}} checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 203–216, 2021.
  32. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
  33. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
  34. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  35. Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 172–181. IEEE, 2020.
  36. OpenAI. Language models are few-shot learners. https://openai.com/blog/gpt-3-apps, 2020.
  37. OpenAI. Chatgpt: Language models for task-oriented dialogue. https://openai.com/blog/chatgpt/, 2021.
  38. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  41. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  42. SageMaker. Sagemaker. https://docs.aws.amazon.com/sagemaker/index.html.
  43. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  44. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  45. Singularity: Planet-scale, preemptive and elastic scheduling of ai workloads, 2022.
  46. Bamboo: Making preemptible instances resilient for affordable training of large {{\{{DNNs}}\}}. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 497–513, 2023.
  47. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 364–381, 2023.
  50. Transom: An efficient fault-tolerant system for training llms. arXiv preprint arXiv:2310.10046, 2023.
  51. Elastic deep learning in multi-tenant gpu clusters. IEEE Transactions on Parallel and Distributed Systems, 33(1):144–158, 2022.
  52. Elan: Towards generic and efficient elastic training for deep learning. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pages 78–88, 2020.
  53. Slurm: Simple linux utility for resource management. Job Scheduling Strategies for Parallel Processing, pages 44–60, 2003.
  54. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  55. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
Citations (11)

Summary

  • The paper introduces Unicron, a self-healing framework that minimizes recovery costs and improves training efficiency in large language models.
  • It leverages in-band error detection and dynamic cost-aware reconfiguration to rapidly identify and address failures in cloud environments.
  • Experimental results demonstrate that Unicron boosts overall training efficiency by up to 1.9 times compared to state-of-the-art methods.

Introduction

The development of LLMs is pivotal for advancing natural language processing capabilities. LLMs such as GPT-3 and BERT have become foundational in AI research and applications, powered by extensive parallelization and optimization frameworks like Megatron-LM and DeepSpeed. The scalability of LLM training has benefited from the use of cloud platforms that facilitate the deployment on GPU-rich clusters. Despite their advantages, cloud-based training environments grapple with high rates of failures. These not only interrupt the training process but also incur additional downtime and economic losses due to the substantial recovery time involved.

Self-Healing in Large-Scale LLM Training

The Unicron system addresses challenges associated with failure recovery during LLM training on cloud platforms. The aim of Unicron is to minimize the total cost of failures by integrating efficient error detection, seamless system state transitions, and optimal reconfiguration in the face of diverse error scenarios. This comprehensive management of failures stands to enhance the reliability and economic efficiency of training LLMs at scale. Unicron is designed to operate alongside existing distributed frameworks like Megatron, preserving all existing optimizations and functionalities. Its novel components, the Unicron agent and coordinator, underpin a strategic approach to self-healing during training interruptions.

Techniques and Architectural Design

Unicron's distributed workload manager uniquely features in-band error detection, allowing rapid identification of issues with negligible overhead. Along with this, the Unicron coordinator assesses and addresses errors through a well-coordinated agent system. To mitigate the downtime often associated with failures, Unicron employs a transition strategy that effectively manages both the transition durations and the recovery process by exploiting partial results from ongoing training iterations. Another crucial aspect of the system is its dynamic cost-aware plan generation. This mechanism, informed by a model considering multiple tasks across the cluster, ensures the selective reconfiguration of tasks for optimal utilization and efficiency.

Results and Impact

Extensive experiments conducted using various training tasks on a cluster with 128 GPUs demonstrate that Unicron substantially reduces costs associated with recovery from failures. Benchmarked against other methods, Unicron’s architectural design and integrated features like checkpointing, error detection, and plan generation significantly improve efficiency. With an ability to achieve up to 1.9 times the overall training efficiency compared to state-of-the-art systems, Unicron provides a promising solution for the economic and resilient training of LLMs.

In summary, Unicron presents a transformative approach in managing distributed LLM training systems. By judiciously navigating the intricacies of failure recovery and resource utilization, it sets a new standard for training large-scale models in cloud environments. The implications of such advancements are manifold, encompassing improved reliability, reduced economic impact of downtimes, and progress in AI capabilities powered by large and complex LLMs.