Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Characterization of Large Language Model Development in the Datacenter (2403.07648v2)

Published 12 Mar 2024 in cs.DC and cs.LG

Abstract: LLMs have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (113)
  1. Bloom. https://bigscience.huggingface.co/blog/bloom, 2024.
  2. Chatgpt. https://openai.com/blog/chatgpt, 2024.
  3. Github copilot. https://github.com/features/copilot/, 2024.
  4. Infiniband networking. https://www.nvidia.com/en-us/networking/products/infiniband/, 2024.
  5. Nccl tests. https://github.com/NVIDIA/nccl-tests, 2024.
  6. Nvidia a100 tensor core gpu. https://www.nvidia.com/en-us/data-center/a100/, 2024.
  7. Nvidia data center gpu manager. https://developer.nvidia.com/dcgm, 2024.
  8. Nvidia-smi. https://developer.nvidia.com/nvidia-system-management-interface, 2024.
  9. Nvlink and nvswitch. https://www.nvidia.com/en-us/data-center/nvlink/, 2024.
  10. Pytorch automatic mixed precision training. https://pytorch.org/docs/stable/amp, 2024.
  11. Pytorch memory snapshottool. https://pytorch.org/blog/understanding-gpu-memory-1, 2024.
  12. Supermicro ipmi. https://www.supermicro.com/en/solutions/management-software/ipmi-utilities, 2024.
  13. Tensorboard. https://www.tensorflow.org/tensorboard, 2024.
  14. Carbon explorer: A holistic framework for designing carbon aware datacenters. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’23, 2023.
  15. Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, EuroSys ’22, 2022.
  16. tf.data service: A case for disaggregating ml input data processing. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’23, 2023.
  17. Program synthesis with large language models. CoRR, 2021.
  18. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR ’15, 2015.
  19. On the opportunities and risks of foundation models. CoRR, 2021.
  20. Language models are few-shot learners. In Advances in Neural Information Processing Systems, NeurIPS ’20, 2020.
  21. Borg, omega, and kubernetes: Lessons learned from three container-management systems over a decade. Queue, 2016.
  22. A survey on evaluation of large language models. CoRR, 2023.
  23. An Ran Chen. An empirical study on leveraging logs for debugging production failures. In Proceedings of the 41st International Conference on Software Engineering: Companion Proceedings, ICSE ’19, 2019.
  24. Evaluating large language models trained on code. CoRR, 2021.
  25. Internevo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding. CoRR, abs/2401.09149, 2024.
  26. Envpipe: Performance-preserving dnn training framework for saving energy. In 2023 USENIX Annual Technical Conference, USENIX ATC ’23, 2023.
  27. Palm: Scaling language modeling with pathways. CoRR, 2022.
  28. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, 2023.
  29. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, NeurIPS ’22, 2022.
  30. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, NeurIPS ’22, 2022.
  31. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’19, 2019.
  32. Check-N-Run: a checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’22, 2022.
  33. Understanding network failures in data centers: measurement, analysis, and implications. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’11, 2011.
  34. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023.
  35. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’19, 2019.
  36. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’15, 2015.
  37. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, ICLR ’22, 2022.
  38. Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, 2021.
  39. Hydro: Surrogate-Based hyperparameter tuning service in datacenters. In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’23, 2023.
  40. Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs. In Proceedings of the 28th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’23, 2023.
  41. Metastable failures in the wild. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’22, 2022.
  42. Elastic resource sharing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’21, 2021.
  43. Data movement is all you need: A case study on optimizing transformers. In Proceedings of Machine Learning and Systems, MLSys ’21, 2021.
  44. Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, SOSP ’23, 2023.
  45. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference, USENIX ATC ’19, 2019.
  46. Mistral 7b. CoRR, abs/2310.06825, 2023.
  47. Megascale: Scaling large language model training to more than 10,000 gpus. CoRR, abs/2402.15627, 2024.
  48. Adam: A method for stochastic optimization. In International Conference on Learning Representations, ICLR ’15, 2015.
  49. Collie: Finding performance anomalies in RDMA subsystems. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’22, 2022.
  50. Reducing activation recomputation in large transformer models. CoRR, 2022.
  51. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, SOSP ’23, 2023.
  52. Log parsing: How far can chatgpt go? In Proceedings of IEEE/ACM International Conference on Automated Software Engineering, ASE ’23, 2023.
  53. Log parsing with prompt-based few-shot learning. In Proceedings of the 45th International Conference on Software Engineering, ICSE ’23, 2023.
  54. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  55. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, NeurIPS ’20, 2020.
  56. Lyra: Elastic scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23, 2023.
  57. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’23, 2023.
  58. Holistic evaluation of language models. CoRR, 2022.
  59. Hostping: Diagnosing intra-host network bottlenecks in RDMA servers. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’23, 2023.
  60. Understanding and improving failure tolerant training for deep learning recommendation with partial recovery. In Proceedings of Machine Learning and Systems, MLSys ’21, 2021.
  61. Themis: Fair and efficient GPU cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’20, 2020.
  62. A pcie congestion-aware performance model for densely populated accelerator servers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016.
  63. Nasa clocks july 2023 as hottest month on record ever since 1880. https://www.nasa.gov/news-release/nasa-clocks-july-2023-as-hottest-month-on-record-ever-since-1880/, 2024.
  64. CheckFreq: Frequent, Fine-Grained DNN checkpointing. In 19th USENIX Conference on File and Storage Technologies, FAST ’21, 2021.
  65. tf.data: A machine learning data processing framework. Proceedings of the VLDB Endowment, 2021.
  66. Structured comparative analysis of systems logs to diagnose performance problems. In 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’12, 2012.
  67. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, 2019.
  68. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, 2021.
  69. Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID ’20, 2020.
  70. Gpu lifetimes on titan supercomputer: Survival analysis and reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20, 2020.
  71. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, NeurIPS ’22, 2022.
  72. Carbon emissions and large neural network training. CoRR, 2021.
  73. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, 2018.
  74. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’21, 2021.
  75. Prometheus: A Next-Generation monitoring system (talk). Dublin, 2015. USENIX Association.
  76. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML ’21, 2021.
  77. Improving language understanding by generative pre-training. 2018.
  78. Language models are unsupervised multitask learners. 2019.
  79. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20, 2020.
  80. Zero-infinity: breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21. Association for Computing Machinery, 2021.
  81. Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference, USENIX ATC ’21, 2021.
  82. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR ’22, 2022.
  83. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. CoRR, 2019.
  84. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, ICLR ’17, 2017.
  85. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, 2020.
  86. Examining failures and repairs on supercomputers with multi-gpu compute nodes. In IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’21, 2021.
  87. NetBouncer: Active device and link failure localization in data center networks. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’19, 2019.
  88. Bamboo: Making preemptible instances resilient for affordable training of large dnns. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’23, 2023.
  89. Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, 2015.
  90. Understanding gpu errors on large-scale hpc systems and the implications for system design and operation. In International Symposium on High Performance Computer Architecture, HPCA ’15, 2015.
  91. Llama: Open and efficient foundation language models. CoRR, 2023.
  92. Llama 2: Open foundation and fine-tuned chat models. CoRR, 2023.
  93. Attention is all you need. In Advances in Neural Information Processing Systems, NeurIPS ’17, 2017.
  94. Horizontally fused training array: An effective hardware utilization squeezer for training novel deep learning models. In Proceedings of Machine Learning and Systems, MLSys ’21, 2021.
  95. Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations, ICLR ’23, 2023.
  96. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, SOSP ’23, 2023.
  97. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’22, 2022.
  98. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’18, 2018.
  99. Antman: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’20, 2020.
  100. Anubis: Towards reliable cloud ai infrastructure via proactive validation. CoRR, abs/2402.06194, 2024.
  101. Astraea: A fair deep learning scheduler for multi-tenant gpu clusters. IEEE Transactions on Parallel and Distributed Systems, 2021.
  102. Slurm: Simple linux utility for resource management. In Job Scheduling Strategies for Parallel Processing, 2003.
  103. Zeus: Understanding and optimizing GPU energy consumption of DNN training. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’23, 2023.
  104. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’22, 2022.
  105. Profiling network performance for multi-tier data center applications. In 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’11, 2011.
  106. Fine-grained gpu sharing primitives for deep learning applications. In Proceedings of Machine Learning and Systems, MLSys ’20, 2020.
  107. SHEPHERD: Serving DNNs in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’23, 2023.
  108. Deepview: Virtual disk failure diagnosis and pattern detection for azure. In 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’18, 2018.
  109. An empirical study on program failures of deep learning jobs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, 2020.
  110. Opt: Open pre-trained transformer language models. CoRR, 2022.
  111. Justitia: Software Multi-Tenancy in hardware Kernel-Bypass networks. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’22, 2022.
  112. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, 2023.
  113. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’22, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Qinghao Hu (31 papers)
  2. Zhisheng Ye (8 papers)
  3. Zerui Wang (12 papers)
  4. Guoteng Wang (6 papers)
  5. Meng Zhang (184 papers)
  6. Qiaoling Chen (14 papers)
  7. Peng Sun (210 papers)
  8. Dahua Lin (336 papers)
  9. Xiaolin Wang (93 papers)
  10. Yingwei Luo (12 papers)
  11. Yonggang Wen (84 papers)
  12. Tianwei Zhang (199 papers)
Citations (22)

Summary

We haven't generated a summary for this paper yet.