PartIR: Composing SPMD Partitioning Strategies for Machine Learning (2401.11202v4)
Abstract: Training of modern large neural networks (NN) requires a combination of parallelization strategies encompassing data, model, or optimizer sharding. When strategies increase in complexity, it becomes necessary for partitioning tools to be 1) expressive, allowing the composition of simpler strategies, and 2) predictable to estimate performance analytically. We present PartIR, our design for a NN partitioning system. PartIR is focused on an incremental approach to rewriting and is hardware-and-runtime agnostic. We present a simple but powerful API for composing sharding strategies and a simulator to validate them. The process is driven by high-level programmer-issued partitioning tactics, which can be both manual and automatic. Importantly, the tactics are specified separately from the model code, making them easy to change. We evaluate PartIR on several different models to demonstrate its predictability, expressibility, and ability to reach peak performance..
- TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, Savannah, GA, November 2016. USENIX Association.
- Automatic discovery of composite spmd partitioning strategies in partir, 2022.
- Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, 2019.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
- Weakly-supervised learning of visual relations in multimodal pretraining. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 2023. Association for Computational Linguistics.
- Measuring progress in fine-grained vision-and-language understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1559–1582, Toronto, Canada, July 2023. Association for Computational Linguistics.
- Veronika Samborska Charlie Giattino, Edouard Mathieu and Max Roser. Data Page: Computation used to train notable artificial intelligence systems. https://ourworldindata.org/grapher/artificial-intelligence-training-computation, 2023. Retrieved from [online resource].
- TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018.
- Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs | Google Cloud Blog — cloud.google.com. https://cloud.google.com/blog/products/ai-machine-learning/train-ml-models-on-large-images-and-3d-volumes-with-spatial-partitioning-on-cloud-tpus. [Accessed 06-12-2023].
- Palm: Scaling language modeling with pathways, 2022.
- Google Developers. Cloud TPU System Architecture. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm, 2023. [Last updated 2023-11-06 UTC.].
- Tapir: Tracking any point with per-frame initialization and temporal refinement. ICCV, 2023.
- Simple GNN regularisation for 3d molecular property prediction and beyond. In International Conference on Learning Representations, 2022.
- Google XLA team. XLA: Optimizing compiler for machine learning, 2017.
- Fireiron: A data-movement-aware scheduling language for gpus. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, PACT ’20, page 71–82, New York, NY, USA, 2020. Association for Computing Machinery.
- Achieving high-performance the functional way: A functional pearl on expressing high-performance optimizations as rewrite strategies. Proc. ACM Program. Lang., 4(ICFP), aug 2020.
- Edge partition modulated graph convolutional networks, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Training compute-optimal large language models, 2022.
- High resolution medical image analysis with spatial partitioning. arXiv preprint arXiv:1909.03108, 2019.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 103–112, 2019.
- Language agents as digital representatives in collective decision-making. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Beyond data and model parallelism for deep neural networks. In Ameet Talwalkar, Virginia Smith, and Matei Zaharia, editors, Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019. mlsys.org, 2019.
- Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–14, 2023.
- Ten lessons from three generations shaped google’s tpuv4i : Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14, 2021.
- In-datacenter performance analysis of a tensor processing unit. CoRR, abs/1704.04760, 2017.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA), oct 2017.
- Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
- Mlir: Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 2–14, 2021.
- Gshard: Scaling giant models with conditional computation and automatic sharding. CoRR, abs/2006.16668, 2020.
- Pipedream: generalized pipeline parallelism for DNN training. In Tim Brecht and Carey Williamson, editors, Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, pages 1–15. ACM, 2019.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
- Efficient large-scale language model training on GPU clusters. CoRR, abs/2104.04473, 2021.
- NVIDIA. Nvidia nvlink and nvswitch. https://www.nvidia.com/en-gb/data-center/nvlink/, 2021. Accessed: 2021-10-07.
- Nvidia. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-gb/data-center/a100/, 2023. [Last updated 2023-11-06 UTC.].
- OpenXLA. OpenXLA: A machine learning compiler for GPUs, CPUs, and ML accelerators . https://github.com/openxla/xla, 2023. [Last updated 2023-11-06 UTC.].
- OpenXLA. StableHLO: Backward compatible ML compute opset inspired by HLO/MHLO. https://github.com/openxla/stablehlo, 2023. [Last updated 2023-11-06 UTC.].
- Optimizing data-intensive computations in existing libraries with split annotations. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 291–305, New York, NY, USA, 2019. Association for Computing Machinery.
- Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d‘Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Getting to the point: Index sets and parallelism-preserving autodiff for pointful array programming. Proc. ACM Program. Lang., 5(ICFP), aug 2021.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021.
- Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, page 519–530, New York, NY, USA, 2013. Association for Computing Machinery.
- Zero: Memory optimization towards training A trillion parameter models. CoRR, abs/1910.02054, 2019.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery.
- Zero-offload: Democratizing billion-scale model training. CoRR, abs/2101.06840, 2021.
- Learning to simulate complex physics with graph networks. In International Conference on Machine Learning, pages 8459–8468. PMLR, 2020.
- Distir: An intermediate representation for optimizing distributed neural networks. In Proceedings of the 1st Workshop on Machine Learning and Systems, EuroMLSys ’21, page 15–23, New York, NY, USA, 2021. Association for Computing Machinery.
- Automap: Towards ergonomic automated parallelism for ml models. arXiv preprint arXiv:2112.02958, 2021.
- Learned force fields are ready for ground state catalyst discovery, 2022.
- Mesh-TensorFlow: Deep learning for supercomputers. In Neural Information Processing Systems, 2018.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance opencl code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming, ICFP 2015, page 205–217, New York, NY, USA, 2015. Association for Computing Machinery.
- Lift: A functional data-parallel ir for high-performance gpu code generation. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO ’17, page 74–85. IEEE Press, 2017.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
- Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 267–284, Carlsbad, CA, July 2022. USENIX Association.
- Attention is all you need. CoRR, abs/1706.03762, 2017.
- Building program optimizers with rewriting strategies. In Proceedings of the Third ACM SIGPLAN International Conference on Functional Programming, ICFP ’98, page 13–26, New York, NY, USA, 1998. Association for Computing Machinery.
- Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2023.
- Auto-map: A DQN framework for exploring distributed execution plans for DNN workloads. CoRR, abs/2007.04069, 2020.
- Automatic cross-replica sharding of weight update in data-parallel training, 2020.
- GSPMD: general and scalable parallelization for ML computation graphs. CoRR, abs/2105.04663, 2021.
- Distal: The distributed tensor algebra compiler. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2022, page 286–300, New York, NY, USA, 2022. Association for Computing Machinery.
- Pre-training via denoising for molecular property prediction, 2022.
- Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. CoRR, abs/2201.12023, 2022.