Blox: A Modular Toolkit for Deep Learning Schedulers (2312.12621v1)
Abstract: Deep Learning (DL) workloads have rapidly increased in popularity in enterprise clusters and several new cluster schedulers have been proposed in recent years to support these workloads. With rapidly evolving DL workloads, it is challenging to quickly prototype and compare scheduling policies across workloads. Further, as prior systems target different aspects of scheduling (resource allocation, placement, elasticity etc.), it is also challenging to combine these techniques and understand the overall benefits. To address these challenges we propose Blox, a modular toolkit which allows developers to compose individual components and realize diverse scheduling frameworks. We identify a set of core abstractions for DL scheduling, implement several existing schedulers using these abstractions, and verify the fidelity of these implementations by reproducing results from prior research. We also highlight how we can evaluate and compare existing schedulers in new settings: different workload traces, higher cluster load, change in DNN workloads and deployment characteristics. Finally, we showcase Blox's extensibility by composing policies from different schedulers, and implementing novel policies with minimal code changes. Blox is available at \url{https://github.com/msr-fiddle/blox}.
- Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 802–814. IEEE, 2021.
- Bigger, longer, fewer: what do cluster jobs look like outside google, 2017.
- Apollo: Scalable and coordinated scheduling for {{\{{Cloud-Scale}}\}} computing. In 11th USENIX symposium on operating systems design and implementation (OSDI 14), pages 285–300, 2014.
- Language models are few-shot learners. arXiv, arXiv/2005.14165, 2020.
- Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In Angelos Bilas, Kostas Magoutis, Evangelos P. Markatos, Dejan Kostic, and Margo I. Seltzer, editors, EuroSys ’20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, April 27-30, 2020, pages 1:1–1:16. ACM, 2020.
- Clipper: A Low-Latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017.
- Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, arXiv/1810.04805, 2018.
- The Flux OSKit: A substrate for kernel and language research. In Proceedings of the Sixteenth ACM Symposium on Operating System Principles, SOSP 1997, St. Malo, France, October 5-8, 1997, pages 38–51. ACM, 1997.
- The Flux OS Toolkit: Reusable components for OS implementation. In Proceedings of The Sixth Workshop on Hot Topics in Operating Systems, HotOS-VI, Cape Cod, Massachusetts, USA, May 5-6, 1997, pages 14–19, 1997.
- Generative adversarial networks. arXiv, arXiv/1406.2661, 2014.
- Google. Grpc:a high performance, open source universal rpc framework. https://grpc.io/, 2012. Accessed: May 18, 2022.
- Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485–500, 2019.
- Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), pages 770–778, Las Vegas, NV, June 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Mesos: A platform for Fine-Grained resource sharing in the data center. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), 2011.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Elastic resource sharing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 721–739, 2021.
- Analysis of Large-Scale Multi-Tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947–960, 2019.
- A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63(7):67–78, 2020.
- The Click Modular Router. ACM Trans. Comput. Syst., 18(3):263–297, 2000.
- Imagenet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), pages 1097–1105, Lake Tahoe, NV, December 2012.
- Kubernetes. Kubernetes. https://kubernetes.io/, 2021. Accessed: May 15, 2021.
- Symbiotic Las. Tiresias. https://github.com/SymbioticLab/Tiresias/tree/master/simulator, 2022. Accessed: December 10, 2022.
- Allox: compute allocation in hybrid clusters. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1–16, 2020.
- Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
- Themis: Fair and efficient GPU cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, 2020.
- Microsoft. Open platform for ai. https://github.com/microsoft/pai, 2022. Accessed: May 18, 2021.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
- Synergy: Resource sensitive dnn scheduling in multi-tenant clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022.
- The click modular router. In Proceedings of the 17th ACM Symposium on Operating System Principles, SOSP 1999, Kiawah Island Resort, near Charleston, South Carolina, USA, December 12-15, 1999, pages 217–231. ACM, 1999.
- Abdallah Moussawi. Towards large scale training of autoencoders for collaborative filtering. arXiv preprint arXiv:1809.00999, 2018.
- Heterogeneity-Aware cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481–498, 2020.
- Sparrow: distributed, low latency scheduling. In Proceedings of the twenty-fourth ACM symposium on operating systems principles, pages 69–84, 2013.
- Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, pages 1–14, 2018.
- Petuum. Artifact for Pollux OSDI 2021. https://github.com/petuum/adaptdl/tree/osdi21-artifact, 2021. Accessed: May 15, 2021.
- Petuum. Adaptdl. https://github.com/petuum/adaptdl, 2022. Accessed: May 18, 2022.
- Petuum. Pollux workload trace. https://github.com/petuum/adaptdl/blob/osdi21-artifact/simulator/workloads/workload-6.csv, 2022. Accessed: December 10, 2022.
- Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 21), 2021.
- Language Models are Unsupervised Multitask Learners. Technical report, OpenAI, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv, arXiv/1910.10683, 2019.
- Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 351–364, 2013.
- Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 322–337, 2019.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv, arXiv/1909.08053, 2019.
- Not all gpus are created equal: characterizing variability in large-scale, accelerator-rich systems. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 01–15. IEEE, 2022.
- High-resolution representations for labeling pixels and regions. arXiv, arXiv/1904.04514, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing, pages 1–16, 2013.
- Evaluating job packing in warehouse-scale computing. In 2014 IEEE International Conference on Cluster Computing (CLUSTER), pages 48–56. IEEE, 2014.
- Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, pages 1–17, 2015.
- Blink: Fast and generic collectives for distributed ml. Proceedings of Machine Learning and Systems, 2:172–186, 2020.
- MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945–960, 2022.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv, arXiv/1609.08144, 2016.
- Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595–610, 2018.
- AntMan: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548, 2020.
- Slurm: Simple linux utility for resource management. In Workshop on job scheduling strategies for parallel processing, pages 44–60. Springer, 2003.
- Spark: Cluster computing with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10), 2010.
- HiveD: Sharing a GPU cluster for deep learning with guarantees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 515–532, 2020.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.