Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning (2306.14086v1)
Abstract: Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient to design a proactive provisioner using production job traces from three GPU clusters. We follow the standard machine learning practice by partitioning each job trace into training and validation subsets, then train each model using the training subset and evaluate the generality using the validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate RL methods. Our experiments show that the Mirage can reduce the interruption by 17-100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.
- [n. d.]. Clipped Proximal Policy Optimization. https://intellabs.github.io/coach/components/agents/policy_optimization/cppo.html.
- [n. d.]. Introducing the AI Research SuperCluster — Meta’s cutting-edge AI supercomputer for AI research.
- 2022. Slurm Simulator. https://github.com/ubccr-slurm-simulator/slurm_simulator.
- Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 2 (2011), 201–212.
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model. (2022).
- Sebastian Bock and Martin Weiß. 2019. A proof of local convergence for the Adam optimizer. In 2019 international joint conference on neural networks (IJCNN). IEEE, 1–8.
- Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
- Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
- Deep Learning Ensembles for Melanoma Recognition in Dermoscopy Images. arXiv preprint arXiv:1610.04662 (2016).
- Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602, 7897 (2022), 414–419.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Deep learning-based point-scanning super-resolution imaging. Nature Methods 18, 4 (2021), 406–416.
- Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
- Geoffrey Hinton. [n. d.]. Mixture of Experts. https://www.cs.toronto.edu/~hinton/csc321/notes/lec15.pdf.
- A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence 45, 1 (2022), 87–110.
- Mesos: A platform for fine-grained resource sharing in the data center. In NSDI.
- LSST: from science drivers to reference design and anticipated data products. The Astrophysical Journal 873, 2 (2019), 111.
- Highly accurate protein structure prediction with AlphaFold. Nature 596, 7873 (2021), 583–589.
- Jennifer Langston. 2020. Microsoft announces new supercomputer, lays out vision for future AI work. https://blogs.microsoft.com/ai/openai-azure-supercomputer/
- Training graph neural networks with 1000 layers. In International conference on machine learning. PMLR, 6437–6449.
- CAPES: Unsupervised storage performance tuning using neural network-based deep reinforcement learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–14.
- Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118 (2018).
- Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM special interest group on data communication. 270–288.
- Human-level control through deep reinforcement learning. nature 518, 7540 (2015), 529–533.
- Using deep learning for image-based plant disease detection. Frontiers in plant science 7 (2016), 1419.
- Ray: A distributed framework for emerging {{\{{AI}}\}} applications. In 13th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 18). 561–577.
- Efficient large-scale language model training on gpu clusters. arXiv preprint arXiv:2104.04473 (2021).
- NERSC. [n. d.]. NERSC Queues and Charges. https://docs.nersc.gov/jobs/policy/#-limits-and-charges.
- QBETS: Queue bounds estimation from time series. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 76–101.
- Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In SC’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. IEEE, 29–29.
- OpenAI. 2020. GPT-3: Language Models are Few-Shot Learners. https://github.com/openai/gpt-3.
- New scheduling approach using reinforcement learning for heterogeneous distributed systems. J. Parallel and Distrib. Comput. 117 (2018), 292–302.
- Deep exploration via bootstrapped DQN. Advances in neural information processing systems 29 (2016).
- Stochastic variance-reduced policy gradient. In International conference on machine learning. PMLR, 4026–4035.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
- KAISA: an adaptive second-order optimizer framework for deep neural networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
- Diversity policy gradient for sample efficient quality-diversity optimization. In Proceedings of the Genetic and Evolutionary Computation Conference. 1075–1083.
- SchedMD. [n. d.]. Multifactor Priority Plugin. https://slurm.schedmd.com/priority_multifactor.html#age.
- The cost of training nlp models: A concise overview. arXiv preprint arXiv:2004.08900 (2020).
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
- Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
- A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 6419 (2018), 1140–1144.
- A Slurm Simulator: Implementation and Parametric Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Stephen Jarvis, Steven Wright, and Simon Hammond (Eds.). Springer International Publishing, 197–217.
- Using run-time predictions to estimate queue wait times and improve scheduler performance. In Workshop on Job scheduling strategies for Parallel Processing. Springer, 202–219.
- Trace-based evaluation of job runtime and queue wait time predictions in grids. In Proceedings of the 18th ACM international symposium on High performance distributed computing. 111–120.
- Tesla. 2021. Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer. https://www.hpcwire.com/2021/06/22/ahead-of-dojo-//tesla-reveals-its-massive-precursor-supercomputer/.
- Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NeurIPS). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
- Slurm: Simple linux utility for resource management. In Workshop on job scheduling strategies for parallel processing. Springer, 44–60.
- Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems 23, 8 (2012), 1177–1193.
- RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.