Optimal, Non-pipelined Reduce-scatter and Allreduce Algorithms
Abstract: The reduce-scatter collective operation in which $p$ processors in a network of processors collectively reduce $p$ input vectors into a result vector that is partitioned over the processors is important both in its own right and as building block for other collective operations. We present a surprisingly simple, but non-trivial algorithm for solving this problem optimally in $\lceil\log_2 p\rceil$ communication rounds with each processor sending, receiving and reducing exactly $p-1$ blocks of vector elements. We combine this with a similarly simple, well-known allgather algorithm to get a volume optimal algorithm for the allreduce collective operation where the result vector is replicated on all processors. The communication pattern is a simple, $\lceil\log_2 p\rceil$-regular, circulant graph also used elsewhere. The algorithms assume the binary reduction operator to be commutative and we discuss this assumption. The algorithms can readily be implemented and used for the collective operations MPI_Reduce_scatter_block, MPI_Reduce_scatter and MPI_Allreduce as specified in the MPI standard. We also observe that the reduce-scatter algorithm can be used as a template for round-optimal all-to-all communication and the collective MPI_Alltoall operation.
- An optimal algorithm for computing census functions in message-passing systems. Parallel Processing Letters, 3(1):19–23, 1993.
- Global combine algorithms for 2−d2𝑑2-d2 - italic_d meshes with wormhole routing. Journal of Parallel and Distributed Computing, 24(2):191–201, 1995.
- Distributed loop computer networks: A survey. Journal of Parallel and Distributed Computing, 24(1):2–10, 1995.
- Efficient implementation of reduce-scatter in MPI. Journal of Systems Architecture, 49(3):89–108, 2003.
- A locality-aware bruck allgather. In 29th European MPI Users’ Group Meeting (EuroMPI/USA), pages 18–26. ACM, 2022.
- Efficient global combine operations in multi-port message-passing systems. Parallel Processing Letters, 3(4):335–346, 1993.
- Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143–1156, 1997.
- Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007.
- Giulio Iannello. Efficient algorithms for the reduce-scatter operation in LogGP. IEEE Transactions on Parallel and Distributed Systems, 8(9):970–982, 1997.
- MPI Forum. MPI: A Message-Passing Interface Standard. Version 4.1, November 2nd 2023. www.mpi-forum.org.
- Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2):117–124, 2009.
- More efficient reduction algorithms for message-passing parallel systems. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 11th European PVM/MPI Users’ Group Meeting, volume 3241 of Lecture Notes in Computer Science, pages 36–46. Springer, 2004.
- Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Computing, 35(12):581–594, 2009.
- Jesper Larsson Träff. An improved algorithm for (non-commutative) reduce-scatter with an application. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 12th European PVM/MPI Users’ Group Meeting, volume 3666 of Lecture Notes in Computer Science, pages 129–137. Springer, 2005.
- Jesper Larsson Träff. Efficient allgather for regular SMP-clusters. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI Users’ Group Meeting, volume 4192 of Lecture Notes in Computer Science, pages 58–65. Springer, 2006.
- Decomposing MPI collectives for exploiting multi-lane communication. In IEEE International Conference on Cluster Computing (CLUSTER), pages 270–280. IEEE Computer Society, 2020.
- Uniform algorithms for reduce-scatter and (most) other collectives for MPI. In IEEE International Conference on Cluster Computing (CLUSTER). IEEE Computer Society, 2023.
- Implementing a classic: Zero-copy all-to-all communication with MPI datatypes. In 28th ACM International Conference on Supercomputing (ICS), pages 135–144. ACM, 2014.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.