(Poly)Logarithmic Time Construction of Round-optimal $n$-Block Broadcast Schedules for Broadcast and irregular Allgather in MPI (2205.10072v2)
Abstract: We give a fast(er), communication-free, parallel construction of optimal communication schedules that allow broadcasting of $n$ distinct blocks of data from a root processor to all other processors in $1$-ported, $p$-processor networks with fully bidirectional communication. For any $p$ and $n$, broadcasting in this model requires $n-1+\lceil\log_2 p\rceil$ communication rounds. In contrast to other constructions, all processors follow the same, circulant graph communication pattern, which makes it possible to use the schedules for the allgather (all-to-all-broadcast) operation as well. The new construction takes $O(\log3 p)$ time steps per processor, each of which can compute its part of the schedule independently of the other processors in $O(\log p)$ space. The result is a significant improvement over the sequential $O(p \log2 p)$ time and $O(p\log p)$ space construction of Tr\"aff and Ripke (2009) with considerable practical import. The round-optimal schedule construction is then used to implement communication optimal algorithms for the broadcast and (irregular) allgather collective operations as found in MPI (the \emph{Message-Passing Interface}), and significantly and practically improves over the implementations in standard MPI libraries (\texttt{mpich}, OpenMPI, Intel MPI) for certain problem ranges. The application to the irregular allgather operation is entirely new.
- Broadcasting multiple messages in the multiport model. IEEE Transactions on Parallel and Distributed Systems, 10(5):500–508, 1999.
- An optimal algorithm for computing census functions in message-passing systems. Parallel Processing Letters, 3(1):19–23, 1993.
- Optimal multiple message broadcasting in telephone-like communication systems. Discrete Applied Mathematics, 100(1–2):1–15, 2000.
- Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143–1156, 1997.
- Arthur M. Farley. Broadcast time in communication networks. SIAM Journal on Applied Mathematics, 39(2):385–390, 1980.
- Broadcasting multiple messages in the 1-in port model in optimal time. Journal of Combinatorial Optimization, 36(4):1333–1355, 2018.
- A new construction of broadcast graphs. Discrete Applied Mathematics, 280:144–155, 2020.
- An efficient heuristic for broadcasting in networks. Journal of Parallel and Distributed Computing, 66(1):68–76, 2006.
- Reproducible MPI benchmarking is still not as easy as you think. IEEE Transactions on Parallel and Distributed Systems, 27(12):3617–3630, 2016.
- Bin Jia. Process cooperation in multiple message broadcast. Parallel Computing, 35(12):572–580, 2009.
- Optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers, 38(9):1249–1268, 1989.
- Optimal broadcast and summation in the LogP model. In 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 142–153, 1993.
- Multiple message broadcasting in communication networks. Networks, 26:253–261, 1995.
- MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.1, June 4th 2015. www.mpi-forum.org.
- Collective operations in NEC’s high-performance MPI libraries. In 20th International Parallel and Distributed Processing Symposium (IPDPS), page 100, 2006.
- Eunice E. Santos. Optimal and near-optimal algorithms for k𝑘kitalic_k-item broadcast. Journal of Parallel and Distributed Computing, 57(2):121–139, 1999.
- Jesper Larsson Träff. Brief announcement: Fast(er) construction of round-optimal n𝑛nitalic_n-block broadcast schedules. In 34th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 143–146. ACM, 2022.
- Jesper Larsson Träff. Fast(er) construction of round-optimal n𝑛nitalic_n-block broadcast schedules. In IEEE International Conference on Cluster Computing (CLUSTER), pages 142–151. IEEE Computer Society, 2022.
- Decomposing MPI collectives for exploiting multi-lane communication. In IEEE International Conference on Cluster Computing (CLUSTER), pages 270–280. IEEE Computer Society, 2020.
- Optimal broadcast for fully connected processor-node networks. Journal of Parallel and Distributed Computing, 68(7):887–901, 2008.
- A pipelined algorithm for large, irregular all-gather problems. International Journal of High Performance Computing Applications, 24(1):58–68, 2010.