Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
37 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning (2110.10548v2)

Published 20 Oct 2021 in cs.PL, cs.DC, and cs.LG

Abstract: We present a novel characterization of the mapping of multiple parallelism forms (e.g. data and model parallelism) onto hierarchical accelerator systems that is hierarchy-aware and greatly reduces the space of software-to-hardware mapping. We experimentally verify the substantial effect of these mappings on all-reduce performance (up to 448x). We offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware way. For 69% of parallelism placements and user requested reductions, our framework synthesizes programs that outperform the default all-reduce implementation when evaluated on different GPU hierarchies (max 2.04x, average 1.27x). We complement our synthesis tool with a simulator exceeding 90% top-10 accuracy, which therefore reduces the need for massive evaluations of synthesis results to determine a small set of optimal programs and mappings.

Citations (16)

Summary

We haven't generated a summary for this paper yet.