Optimal, Non-pipelined Reduce-scatter and Allreduce Algorithms

Published 18 Oct 2024 in cs.DC | (2410.14234v4)

Abstract: The reduce-scatter collective operation in which $p$ processors in a network of processors collectively reduce $p$ input vectors into a result vector that is partitioned over the processors is important both in its own right and as building block for other collective operations. We present a surprisingly simple, but non-trivial algorithm for solving this problem optimally in $\lceil\log_2 p\rceil$ communication rounds with each processor sending, receiving and reducing exactly $p-1$ blocks of vector elements. We combine this with a similarly simple, well-known allgather algorithm to get a volume optimal algorithm for the allreduce collective operation where the result vector is replicated on all processors. The communication pattern is a simple, $\lceil\log_2 p\rceil$-regular, circulant graph also used elsewhere. The algorithms assume the binary reduction operator to be commutative and we discuss this assumption. The algorithms can readily be implemented and used for the collective operations MPI_Reduce_scatter_block, MPI_Reduce_scatter and MPI_Allreduce as specified in the MPI standard. We also observe that the reduce-scatter algorithm can be used as a template for round-optimal all-to-all communication and the collective MPI_Alltoall operation.