Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Saturn: An Optimized Data System for Large Model Deep Learning Workloads (2309.01226v2)

Published 3 Sep 2023 in cs.LG, cs.AI, and cs.DC

Abstract: LLMs such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and SchedulE. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than typical current DL practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. 2020. State-of-the-Art Language Modeling Using Megatron on the NVIDIA A100 GPU. https://developer.nvidia.com/blog/language-modeling-using-megatron-a100-gpu/.
  2. 2021. Fully Sharded Data Parallel: faster AI training with fewer GPUs. https://engineering.fb.com/2021/07/15/open-source/fsdp/.
  3. 2023. nvidia-smi (1) User’s Manual.
  4. Accessed May 24, 2023. 2023 State of Data + AI. https://www.databricks.com/sites/default/files/2023-05/databricks-2023-state-of-data-report.pdf
  5. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
  6. Survey of scheduling algorithms in IEEE 802.16 PMP networks. Egyptian Informatics Journal 15 (03 2014). https://doi.org/10.1016/j.eij.2013.12.001
  7. FairScale authors. 2021. FairScale: A general purpose modular PyTorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale.
  8. James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of machine learning research 13, 2 (2012).
  9. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs.LG]
  10. Language Models are Few-Shot Learners. https://doi.org/10.48550/ARXIV.2005.14165
  11. Training Deep Nets with Sublinear Memory Cost. https://doi.org/10.48550/ARXIV.1604.06174
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  13. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://doi.org/10.48550/ARXIV.1810.04805
  14. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/ARXIV.2010.11929
  15. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.
  16. Deep Learning. MIT Press. http://www.deeplearningbook.org/
  17. Tiresias: A {{\{{GPU}}\}} cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 485–500.
  18. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 443–462. https://www.usenix.org/conference/osdi20/presentation/gujarati
  19. Gurobi Optimization, LLC. 2022a. Gurobi Optimizer Reference Manual. https://www.gurobi.com
  20. Gurobi Optimization, LLC. 2022b. Mixed Integer Programming Basics. https://www.gurobi.com/resources/mixed-integer-programming-mip-a-primer-on-the-basics/
  21. Starfish: A Self-tuning System for Big Data Analytics.. In Cidr, Vol. 11. 261–272.
  22. SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping. In Proceedings of the Twenty Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Virtual).
  23. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. https://doi.org/10.48550/ARXIV.1811.06965
  24. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. , 14 pages.
  25. Beyond Data and Model Parallelism for Deep Neural Networks. https://doi.org/10.48550/ARXIV.1807.05358
  26. torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models. https://doi.org/10.48550/ARXIV.2004.09910
  27. Model selection management systems: The next frontier of advanced analytics. ACM SIGMOD Record 44, 4 (2016), 17–22.
  28. Cerebro: A Layered Data Platform for Scalable Deep Learning. In 11th Annual Conference on Innovative Data Systems Research (CIDR’21).
  29. ModelKeeper: Accelerating DNN Training via Automated Training Warmup. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 769–785. https://www.usenix.org/conference/nsdi23/presentation/lai-fan
  30. A System for Massively Parallel Hyperparameter Tuning. (2018). https://doi.org/10.48550/ARXIV.1810.05934
  31. Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits. CoRR abs/1603.06560 (2016). arXiv:1603.06560 http://arxiv.org/abs/1603.06560
  32. Parameter server for distributed machine learning. In Big learning NIPS workshop, Vol. 6.
  33. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. https://doi.org/10.48550/ARXIV.2006.15704
  34. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. https://doi.org/10.48550/ARXIV.2102.07988
  35. Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models. arXiv:2304.01852 [cs.CL]
  36. Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers. https://doi.org/10.48550/ARXIV.2110.07029
  37. Themis: Fair and efficient {{\{{GPU}}\}} cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 289–304.
  38. Training deeper models by GPU memory optimization on TensorFlow. In Proc. of ML Systems Workshop in NIPS, Vol. 7.
  39. Pointer Sentinel Mixture Models. https://doi.org/10.48550/ARXIV.1609.07843
  40. Ray: A Distributed Framework for Emerging AI Applications. https://doi.org/10.48550/ARXIV.1712.05889
  41. Kabir Nagrecha. 2021. Model-Parallel Model Selection for Deep Learning Systems. In Proceedings of the 2021 International Conference on Management of Data. ACM. https://doi.org/10.1145/3448016.3450571
  42. Kabir Nagrecha. 2023. Systems for Parallel and Distributed Large-Model Deep Learning Training.
  43. Kabir Nagrecha and Arun Kumar. 2023. Tech Report of Saturn: An Optimized Data System for Multi-Large Model Deep Learning. https://adalabucsd.github.io/papers/TR_2023_Saturn.pdf
  44. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. https://doi.org/10.48550/ARXIV.2008.09213
  45. Accelerating model search with model batching. In 1st Conference on Systems and Machine Learning (SysML), SysML, Vol. 18.
  46. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
  47. Deep Learning Recommendation Model for Personalization and Recommendation Systems. https://doi.org/10.48550/ARXIV.1906.00091
  48. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. 1–14.
  49. DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters. https://doi.org/10.48550/ARXIV.1909.06040
  50. Paleo: A performance model for deep neural networks. (2016).
  51. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 21).
  52. Language Models are Unsupervised Multitask Learners. (2019).
  53. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. https://doi.org/10.48550/ARXIV.1910.02054
  54. HyperDrive: Exploring Hyperparameters with POP Scheduling. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Las Vegas, Nevada) (Middleware ’17). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3135974.3135994
  55. ZeRO-Offload: Democratizing Billion-Scale Model Training. https://doi.org/10.48550/ARXIV.2101.06840
  56. NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access. In Proceedings of the 2022 International Conference on Management of Data. ACM. https://doi.org/10.1145/3514221.3517860
  57. Efficient autoscaling in the cloud using predictive models for workload forecasting. In 2011 IEEE 4th International Conference on Cloud Computing. IEEE, 500–507.
  58. Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
  59. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://doi.org/10.48550/ARXIV.1909.08053
  60. Low-Memory Neural Network Training: A Technical Report. https://doi.org/10.48550/ARXIV.1904.10631
  61. Alan Tucker. 1994. Applied combinatorics. John Wiley & Sons, Inc.
  62. Unity: Accelerating {{\{{DNN}}\}} Training Through Joint Optimization of Algebraic Transformations and Parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 267–284.
  63. Automatic Database Management System Tuning Through Large-Scale Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 1009–1024. https://doi.org/10.1145/3035918.3064029
  64. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  65. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. https://doi.org/10.48550/ARXIV.1910.03771
  66. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 595–610.
  67. {{\{{AntMan}}\}}: Dynamic Scaling on {{\{{GPU}}\}} Clusters for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 533–548.
  68. PipeMare: Asynchronous Pipeline Parallel DNN Training. arXiv:1910.05124 [cs.DC]
  69. Poseidon: An efficient communication architecture for distributed deep learning on {{\{{GPU}}\}} clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 181–193.
  70. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. https://doi.org/10.48550/ARXIV.2201.12023
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Kabir Nagrecha (6 papers)
  2. Arun Kumar (78 papers)
Citations (5)