Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 28 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Communication Round and Computation Efficient Exclusive Prefix-Sums Algorithms (for MPI_Exscan) (2507.04785v1)

Published 7 Jul 2025 in cs.DC

Abstract: Parallel scan primitives compute element-wise inclusive or exclusive prefix sums of input vectors contributed by $p$ consecutively ranked processors under an associative, binary operator $\oplus$. In message-passing systems with bounded, one-ported communication capabilities, at least $\lceil\log_2 p\rceil$ or $\lceil\log_2 (p-1)\rceil$ communication rounds are required to perform the scans. While there are well-known, simple algorithms for the inclusive scan that solve the problem in $\lceil\log_2 p\rceil$ communication rounds with $\lceil\log_2 p\rceil$ applications of $\oplus$ (which could be expensive), the exclusive scan appears more difficult. Conventionally, the problem is solved with either $\lceil\log_2 (p-1)\rceil+1$ communication rounds (e.g., by shifting the input vectors), or in $\lceil\log_2 p\rceil$ communication rounds with $2\lceil\log_2 p\rceil-1$ applications of $\oplus$ (by a modified inclusive scan algorithm). We give a new, simple algorithm that computes the exclusive prefix sums in $q=\lceil\log_2 (p-1)+\log_2\frac{4}{3}\rceil$ simultaneous send-receive communication rounds with $q-1$ applications of $\oplus$. We compare the three algorithms implemented in MPI against the MPI library native MPI_Exscan primitive on a small, $36$-node cluster with a state-of-the-art MPI library, indicating possible and worthwhile improvements to standard implementations. The algorithms assume input vectors to be small so that performance is dominated by the number of communication rounds. For large input vectors, other (pipelined, fixed-degree tree) algorithms must be used.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.