Papers
Topics
Authors
Recent
2000 character limit reached

Swing: Short-cutting Rings for Higher Bandwidth Allreduce (2401.09356v2)

Published 17 Jan 2024 in cs.DC, cs.LG, cs.NI, and cs.PF

Abstract: The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely used on systems optimized for machine learning workloads (e.g., Google TPUs and Amazon Trainium devices), as well as on some of the Top500 supercomputers. To improve allreduce performance on torus networks we introduce Swing, a new algorithm that keeps a low distance between communicating nodes by swinging between torus directions. Our analysis and experimental evaluation show that Swing outperforms by up to 3x existing allreduce algorithms for vectors ranging from 32B to 128MiB, on different types of torus and torus-like topologies, regardless of their shape and size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. A simulator for large-scale parallel computer architectures. Int. J. Distrib. Syst. Technol., 1(2):5773, apr 2010.
  2. An overview of the bluegene/l supercomputer. In SC ’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, pages 60–60, 2002.
  3. Hyperx: Topology, routing, and packaging of efficient large-scale networks. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, New York, NY, USA, 2009. Association for Computing Machinery.
  4. The tofu interconnect d. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), pages 646–654, 2018.
  5. Loggp: Incorporating long messages into the logp model for parallel computation. Journal of Parallel and Distributed Computing, 44(1):71–79, 1997.
  6. Optimization of mpi collective communication on bluegene/l systems. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS ’05, page 253–262, New York, NY, USA, 2005. Association for Computing Machinery.
  7. Amazon. AWS Trn1 Architecture. https://awsdocs-neuron.readthedocs-hosted.com/en/v2.3.0/general/arch/neuron-hardware/trn1-arch.html, 2023. Accessed: 06-Mar-2023.
  8. An optimal algorithm for computing census functions in message-passing systems. Parallel Processing Letters, 03(01):19–23, 1993.
  9. Global combine algorithms for 2-d meshes with wormhole routing. Journal of Parallel and Distributed Computing, 24(2):191–201, 1995.
  10. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis, 2018.
  11. The Blue Waters Super-System for Super-Science, pages 339–366. 11 2017.
  12. Efficient global combine operations in multi-port message-passing systems. Parallel Processing Letters, 03(04):335–346, 1993.
  13. Synthesizing optimal collective algorithms. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, page 62–75, New York, NY, USA, 2021. Association for Computing Machinery.
  14. Characterization of mpi usage on a production supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 386–400, 2018.
  15. Logp: Towards a realistic model of parallel computation. SIGPLAN Not., 28(7):1–12, jul 1993.
  16. Flare: Flexible in-network allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery.
  17. An In-Depth Analysis of the Slingshot Interconnect. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), Nov. 2020.
  18. Hyperx topology: First at-scale implementation and comparison to the fat-tree. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA, 2019. Association for Computing Machinery.
  19. Optimized mpi collective algorithms for dragonfly topology. In Proceedings of the 36th ACM International Conference on Supercomputing, ICS ’22, New York, NY, USA, 2022. Association for Computing Machinery.
  20. Scalable hierarchical aggregation and reduction protocol (sharp)tm streaming-aggregation hardware design and evaluation. High Performance Computing, 12151:41 – 59, 2020.
  21. Improving the performance of mpi derived datatypes. In Proceedings of the Third MPI Developer’s and User’s Conference, pages 25–30. Citeseer, 1999.
  22. Hammingmesh: A network topology for large-scale deep learning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’22. IEEE Press, 2022.
  23. Torsten Hoefler and D. Moor. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Journal of Supercomputing Frontiers and Innovations, 1(2):58–75, Oct. 2014.
  24. Datacenter ethernet and rdma: Issues at hyperscale, 2023.
  25. Optimal bucket algorithms for large mpi collectives on torus interconnects. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, page 27–36, New York, NY, USA, 2010. Association for Computing Machinery.
  26. An optimisation of allreduce communication in message-passing systems. Parallel Computing, 107:102812, 2021.
  27. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023.
  28. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 996–1009, 2020.
  29. A generalization of the allreduce operation, 2020.
  30. Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, aug 2020.
  31. Paard: Proximity-aware all-reduce communication for dragonfly networks. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pages 255–262, 2021.
  32. Open MPI. Open MPI Allreduce. https://github.com/open-mpi/ompi/blob/main/ompi/mca/coll/base/coll_base_allreduce.c, 2023. Accessed: 06-Mar-2023.
  33. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput., 69(2):117–124, feb 2009.
  34. Performance analysis of mpi collective operations. In 19th IEEE International Parallel and Distributed Processing Symposium, pages 8 pp.–, 2005.
  35. Rolf Rabenseifner. Optimization of collective reduction operations. In Marian Bubak, Geert Dick van Albada, Peter M. A. Sloot, and Jack Dongarra, editors, Computational Science - ICCS 2004, pages 1–9, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
  36. More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In Dieter Kranzlmüller, Péter Kacsuk, and Jack Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 36–46, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
  37. Generalisation of recursive doubling for allreduce: Now with simulation. Parallel Computing, 69:24–44, 2017.
  38. Collective algorithms for multiported torus networks. ACM Trans. Parallel Comput., 1(2), feb 2015.
  39. Scaling distributed machine learning with In-Network aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 785–808. USENIX Association, April 2021.
  40. MPI datatype processing using runtime compilation. In Jack J. Dongarra, Javier García Blas, and Jesús Carretero, editors, 20th European MPI Users’s Group Meeting, EuroMPI ’13, Madrid, Spain - September 15 - 18, 2013, pages 19–24. ACM, 2013.
  41. Taccl: Guiding collective algorithm synthesis using communication sketches, 2022.
  42. Improving the performance of collective operations in mpich. In Jack Dongarra, Domenico Laforenza, and Salvatore Orlando, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 257–267, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
  43. Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. Appl., 19(1):49–66, feb 2005.
  44. TopoOpt: Co-optimizing network topology and parallelization strategy for distributed training jobs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 739–767, Boston, MA, April 2023. USENIX Association.
  45. J. Worringen. Pipelining and overlapping for mpi collective operations. In 28th Annual IEEE International Conference on Local Computer Networks, 2003. LCN ’03. Proceedings., pages 548–557, 2003.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 53 likes about this paper.