$k$-ported vs. $k$-lane Broadcast, Scatter, and Alltoall Algorithms (2008.12144v1)
Abstract: In $k$-ported message-passing systems, a processor can simultaneously receive $k$ different messages from $k$ other processors, and send $k$ different messages to $k$ other processors that may or may not be different from the processors from which messages are received. Modern clustered systems may not have such capabilities. Instead, compute nodes consisting of $n$ processors can simultaneously send and receive $k$ messages from other nodes, by letting $k$ processors on the nodes concurrently send and receive at most one message. We pose the question of how to design good algorithms for this $k$-lane model, possibly by adapting algorithms devised for the traditional $k$-ported model. We discuss and compare a number of (non-optimal) $k$-lane algorithms for the broadcast, scatter and alltoall collective operations (as found in, e.g., MPI), and experimentally evaluate these on a small $36\times 32$-node cluster with a dual OmniPath network (corresponding to $k=2$). Results are preliminary.