Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient (2301.11913v2)

Published 27 Jan 2023 in cs.DC and cs.LG

Abstract: Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer LLM with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (104)
  1. Making asynchronous stochastic gradient descent work for transformers. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pp.  80–89, Hong Kong, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5608. URL https://aclanthology.org/D19-5608.
  2. Allen, D. H. How Mechanics Shaped the Modern World. 2013. ISBN 9783319017013.
  3. A refined laser method and faster matrix multiplication. In Marx, D. (ed.), Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, pp.  522–539. SIAM, 2021. doi: 10.1137/1.9781611976465.32. URL https://doi.org/10.1137/1.9781611976465.32.
  4. A tight convergence analysis for stochastic gradient descent with delayed updates. In Kontorovich, A. and Neu, G. (eds.), Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pp.  111–132. PMLR, 2020. URL https://proceedings.mlr.press/v117/arjevani20a.html.
  5. Distributed deep learning using volunteer computing-like paradigm. In IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2021, Portland, OR, USA, June 17-21, 2021, pp.  933–942. IEEE, 2021. doi: 10.1109/IPDPSW52791.2021.00144. URL https://doi.org/10.1109/IPDPSW52791.2021.00144.
  6. Layer normalization. ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607.06450.
  7. Adaptive input representations for neural language modeling. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ.
  8. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021.
  9. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Comput. Surv., 52(4), 2019. ISSN 0360-0300. doi: 10.1145/3320060. URL https://doi.org/10.1145/3320060.
  10. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, 2021. URL https://doi.org/10.5281/zenodo.5297715.
  11. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  12. Training deep nets with sublinear memory cost. ArXiv preprint, abs/1604.06174, 2016. URL https://arxiv.org/abs/1604.06174.
  13. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp.  571–582, Broomfield, CO, 2014. USENIX Association. ISBN 978-1-931971-16-4. URL https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi.
  14. PaLM: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.
  15. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pp.  1337–1345. JMLR.org, 2013. URL http://proceedings.mlr.press/v28/coates13.html.
  16. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9(3):251–280, 1990. ISSN 0747-7171. doi: https://doi.org/10.1016/S0747-7171(08)80013-2. URL https://www.sciencedirect.com/science/article/pii/S0747717108800132. Computational algebraic complexity editorial.
  17. Coatnet: Marrying convolution and attention for all data sizes. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  3965–3977, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/20568692db622456cc42a2e853ca21f8-Abstract.html.
  18. Large scale distributed deep networks. In Bartlett, P. L., Pereira, F. C. N., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pp.  1232–1240, 2012. URL https://proceedings.neurips.cc/paper/2012/hash/6aca97005c68f1206823815f66102863-Abstract.html.
  19. Dettmers, T. 8-bit approximations for parallelism in deep learning. In Bengio, Y. and LeCun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.04561.
  20. 8-bit optimizers via block-wise quantization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=shpkpVXzo3h.
  21. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  22. Diffusion models beat gans on image synthesis. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  8780–8794, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/49ad23d1ec9fa4bd8d77d02681df5cfa-Abstract.html.
  23. Distributed deep learning in open collaborations. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  7879–7897, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/41a60377ba920919939d83326ebee5a1-Abstract.html.
  24. ElasticHorovod. Elastic Horovod. https://horovod.readthedocs.io/en/stable/elastic_include.html. Accessed: 2021-10-04.
  25. Understanding the efficiency of gpu algorithms for matrix-matrix multiplication. pp.  133–137, 2004. doi: 10.1145/1058129.1058148.
  26. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. ArXiv preprint, abs/2101.03961, 2021. URL https://arxiv.org/abs/2101.03961.
  27. Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36:193–202, 1980.
  28. Galileo, G. Discorsi e dimostrazioni matematiche intorno a due nuove scienze. 1638.
  29. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  30. Openwebtext corpus, 2019. URL http://Skylion007.github.io/OpenWebTextCorpus.
  31. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pp.  1319–1327. JMLR.org, 2013. URL http://proceedings.mlr.press/v28/goodfellow13.html.
  32. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS), 26(1):19–45, 2000.
  33. Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17, pp.  589–604, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450349383. doi: 10.1145/3064176.3064182. URL https://doi.org/10.1145/3064176.3064182.
  34. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
  35. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=XPZIaotutsD.
  36. Long Short-Term Memory. Technical Report FKI-207-95, Fakultät für Informatik, Technische Universität München, 1995. Revised 1996 (see www.idsia.ch/~juergen, www7.informatik.tu-muenchen.de/~hochreit).
  37. Strassen’s algorithm reloaded on gpus. ACM Trans. Math. Softw., 46(1), 2020. ISSN 0098-3500. doi: 10.1145/3372419. URL https://doi.org/10.1145/3372419.
  38. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  103–112, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/093f65e080a295f8076b1c5722a46aa2-Abstract.html.
  39. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. ISSN 0899-7667. doi: 10.1162/neco.1991.3.1.79. URL https://doi.org/10.1162/neco.1991.3.1.79.
  40. Beyond data and model parallelism for deep neural networks. In Talwalkar, A., Smith, V., and Zaharia, M. (eds.), Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019. mlsys.org, 2019. URL https://proceedings.mlsys.org/book/265.pdf.
  41. Scaling laws for neural language models, 2020.
  42. Weighted round-robin cell multiplexing in a general-purpose atm switch chip. IEEE Journal on Selected Areas in Communications, 9(8):1265–1279, 1991. doi: 10.1109/49.105173.
  43. A hybrid gpu cluster and volunteer computing platform for scalable deep learning. The Journal of Supercomputing, 2018. doi: 10.1007/s11227-018-2375-9.
  44. Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. URL http://arxiv.org/abs/1404.5997.
  45. Imagenet classification with deep convolutional neural networks. In Bartlett, P. L., Pereira, F. C. N., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pp.  1106–1114, 2012. URL https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html.
  46. Large memory layers with product keys. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  8546–8557, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/9d8df73a3cfbf3c5b47bc9b50f214aff-Abstract.html.
  47. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
  48. Langston, J. Microsoft announces new supercomputer, lays out vision for future ai work. https://blogs.microsoft.com/ai/openai-azure-supercomputer/, 2020. Accessed: 2021-10-1.
  49. Scaling the summit: Deploying the world’s fastest supercomputer. In Weiland, M., Juckeland, G., Alam, S. R., and Jagode, H. (eds.), High Performance Computing - ISC High Performance 2019 International Workshops, Frankfurt, Germany, June 16-20, 2019, Revised Selected Papers, volume 11887 of Lecture Notes in Computer Science, pp.  330–351. Springer, 2019. doi: 10.1007/978-3-030-34356-9_26. URL https://doi.org/10.1007/978-3-030-34356-9_26.
  50. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb.
  51. Curriculum learning: A regularization method for efficient and stable billion-scale GPT model pre-training. ArXiv preprint, abs/2108.06084, 2021. URL https://arxiv.org/abs/2108.06084.
  52. Speeding up deep learning with transient servers. In 2019 IEEE International Conference on Autonomic Computing (ICAC), pp.  125–135. IEEE, 2019.
  53. Breaking (global) barriers in parallel stochastic optimization with wait-avoiding group averaging. IEEE Transactions on Parallel and Distributed Systems, pp. 1–1, 2020. ISSN 2161-9883. doi: 10.1109/tpds.2020.3040606. URL http://dx.doi.org/10.1109/TPDS.2020.3040606.
  54. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5330–5340, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/f75526659f31040afeb61cb7133e4e6d-Abstract.html.
  55. Multi-node bert-pretraining: Cost-efficient approach, 2020.
  56. A survey of transformers. AI Open, 3:111–132, 2022. doi: 10.1016/j.aiopen.2022.10.001. URL https://doi.org/10.1016/j.aiopen.2022.10.001.
  57. Deep gradient compression: Reducing the communication bandwidth for distributed training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=SkhQHMW0W.
  58. Kademlia: A peer-to-peer information system based on the xor metric. In International Workshop on Peer-to-Peer Systems, pp. 53–65. Springer, 2002.
  59. Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Byj72udxe.
  60. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, pp.  1–15, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368735. doi: 10.1145/3341301.3359646. URL https://doi.org/10.1145/3341301.3359646.
  61. Efficient large-scale language model training on gpu clusters. ArXiv preprint, abs/2104.04473, 2021. URL https://arxiv.org/abs/2104.04473.
  62. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp.  48–53, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-4009. URL https://aclanthology.org/N19-4009.
  63. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  8024–8035, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
  64. Training large neural networks with constant memory using a new execution algorithm. ArXiv preprint, abs/2002.05645, 2020. URL https://arxiv.org/abs/2002.05645.
  65. Improving language understanding by generative pre-training. 2018. URL https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
  66. Language models are unsupervised multitask learners. 2019.
  67. Scaling language models: Methods, analysis & insights from training gopher, 2021.
  68. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  69. Zero: Memory optimization towards training a trillion parameter models. In SC, 2020.
  70. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. ArXiv preprint, abs/2104.07857, 2021. URL https://arxiv.org/abs/2104.07857.
  71. Zero-shot text-to-image generation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8821–8831. PMLR, 2021. URL http://proceedings.mlr.press/v139/ramesh21a.html.
  72. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pp.  693–701, 2011. URL https://proceedings.neurips.cc/paper/2011/hash/218a0aefd1d1a4be65601cc6ddc1520e-Abstract.html.
  73. Zero-offload: Democratizing billion-scale model training, 2021.
  74. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
  75. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/25ddc0f8c9d3e22e03d3076f98d83cb2-Abstract.html.
  76. Moshpit SGD: communication-efficient decentralized training on heterogeneous unreliable devices. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  18195–18211, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/97275a23ca44226c9964043c8462be96-Abstract.html.
  77. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1715–1725, Berlin, Germany, 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
  78. Shazeer, N. GLU variants improve transformer. ArXiv preprint, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
  79. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
  80. Mesh-tensorflow: Deep learning for supercomputers. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp.  10435–10444, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/3a37abdeefe1dab1b30f7c5c7e581b93-Abstract.html.
  81. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. ArXiv preprint, abs/1909.08053, 2019. URL https://arxiv.org/abs/1909.08053.
  82. The error-feedback framework: Sgd with delayed gradients. Journal of Machine Learning Research, 21(237):1–36, 2020. URL http://jmlr.org/papers/v21/19-748.html.
  83. Fugaku. https://www.top500.org/system/179807/, 2021. Estimated energy consumption 29,899.23 kW. Accessed: 2021-10-4.
  84. Roformer: Enhanced transformer with rotary position embedding, 2021.
  85. ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. ArXiv preprint, abs/2107.02137, 2021. URL https://arxiv.org/abs/2107.02137.
  86. Interleaved weighted round-robin: A network calculus analysis. In 2020 32nd International Teletraffic Congress (ITC 32), pp. 64–72, 2020. doi: 10.1109/ITC3249928.2020.00016.
  87. Communication-efficient distributed deep learning: A comprehensive survey, 2020.
  88. Piper: Multidimensional planner for DNN parallelization. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  24829–24840, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/d01eeca8b24321cd2fe89dd85b9beb51-Abstract.html.
  89. Bamboo: Making preemptible instances resilient for affordable training of large dnns, 2022.
  90. TorchElastic. PyTorch Elastic. https://pytorch.org/elastic. Accessed: 2021-10-04.
  91. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  92. Verizon. Monthly ip latency data, 2021. Accessed: 2021-10-05.
  93. Powersgd: Practical low-rank gradient compression for distributed optimization. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  14236–14245, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/d9fbed9da256e344c1fa46bb46c34c5f-Abstract.html.
  94. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  95. Fine-tuning language models over slow networks using activation quantization with guarantees. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=QDPonrGtl1.
  96. BPPSA: scaling back-propagation by parallel scan algorithm. In Dhillon, I. S., Papailiopoulos, D. S., and Sze, V. (eds.), Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020. URL https://proceedings.mlsys.org/book/317.pdf.
  97. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  98. Pipemare: Asynchronous pipeline parallel dnn training. ArXiv, abs/1910.05124, 2019.
  99. Large batch optimization for deep learning: Training BERT in 76 minutes. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=Syx4wnEtvH.
  100. Decentralized training of foundation models in heterogeneous environments. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=UHoGOaGjEq.
  101. Scaling vision transformers. ArXiv preprint, abs/2106.04560, 2021. URL https://arxiv.org/abs/2106.04560.
  102. Matrix multiplication on high-density multi-gpu architectures: Theoretical and experimental investigations. In Kunkel, J. M. and Ludwig, T. (eds.), High Performance Computing - 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, Proceedings, volume 9137 of Lecture Notes in Computer Science, pp.  17–30. Springer, 2015. doi: 10.1007/978-3-319-20119-1_2. URL https://doi.org/10.1007/978-3-319-20119-1_2.
  103. Machine learning on volatile instances. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pp.  139–148. IEEE, 2020.
  104. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning, 2022. URL https://arxiv.org/abs/2201.12023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Max Ryabinin (29 papers)
  2. Tim Dettmers (22 papers)
  3. Michael Diskin (6 papers)
  4. Alexander Borzunov (7 papers)
Citations (25)

Summary

  • The paper introduces the Square-Cube Law, showing that as model size increases, communication overhead grows slower than computation.
  • The paper proposes a fault-tolerant, adaptive training mechanism that dynamically reallocates resources to maintain performance across unreliable networks.
  • The paper integrates advanced compression techniques to reduce bandwidth requirements while achieving competitive throughput compared to traditional methods.

SWARM Parallelism: A Communication-Efficient Method for Large-Scale Model Training on Unreliable and Heterogeneous Networks

The paper "SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient" presents a novel approach for training large neural networks using distributed systems composed of heterogeneous, unreliable, and poorly connected devices. This work primarily addresses the growing need to efficiently train billion-parameter models across more cost-effective infrastructures, such as preemptible cloud instances or pooled resources from disparate regions.

Key Contributions and Findings

The central contribution of this paper is the introduction and empirical validation of SWARM parallelism, an algorithm that allows decentralized model-parallel training on systems with constrained connections and diverse hardware capabilities. There are several notable findings and contributions from the paper:

  1. Communication Scalability: The work presents the "Square-Cube Law" of distributed training, positing that as model size increases, the communication cost between devices grows more slowly than the computational cost. This counterintuitive insight implies that larger models might achieve greater efficiency in distributed setups, contrary to the challenges typically associated with communication overhead in large-scale models.
  2. Adaptive and Fault-Tolerant Training: SWARM parallelism introduces a fault-tolerant, adaptive mechanism that increases system resilience to node failures. By constructing temporary randomized pipelines and dynamically reallocating nodes in response to performance metrics, the system ensures continuous training despite varying compute and network conditions.
  3. Integration with Compression Techniques: SWARM parallelism effectively combines the proposed framework with existing data compression strategies, such as 8-bit quantization for activations and shared weight mechanisms, further reducing communication costs and enabling efficient training of large-scale models on infrastructures with limited bandwidth.
  4. Performance Validation: The authors meticulously analyze SWARM's performance, showing that it can achieve high utilization rates for large models even with bandwidths capped below 200 Mb/s and under significant latency impacts. Notably, the method reaches competitive throughput when benchmarked against traditional distributed training approaches like GPipe and ZeRO-Offload.

Implications and Future Directions

The implications of the paper are twofold. Practically, SWARM parallelism democratizes the training of extensive neural networks by facilitating their execution on cheaper and widely accessible cloud resources, allowing researchers and smaller organizations to conduct experiments previously restricted to well-funded institutions with dedicated HPC clusters. Theoretically, the introduction of the Square-Cube Law opens new avenues for optimizing distributed training frameworks, incentivizing further research into scalable and efficient division strategies for large-scale neural architectures.

In the future, research could build on this foundation by exploring more sophisticated model partitioning techniques to alleviate the fixed-layer assumption currently employed. Additionally, investigating the integration of other advanced compression methods, beyond 8-bit quantization, could further optimize SWARM for even broader application scenarios.

Overall, the paper represents a significant stride in accommodating the global trend towards increasingly massive models, ensuring that their development and deployment remain feasible across a diverse range of computational environments. In doing so, it paves the way for more inclusive participation in cutting-edge AI research and development.

Youtube Logo Streamline Icon: https://streamlinehq.com