TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches (2111.04867v4)

Published 8 Nov 2021 in cs.DC and cs.LG

Abstract: Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as AlltoAll and AllReduce, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the Nvidia Collective Communication Library (NCCL) by up to 6.7x. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%--2.3x for different batch sizes.

Citations (39)

View on Semantic Scholar

Summary

The paper introduces TACCL, a novel system that uses communication sketches to guide the synthesis of collective communication algorithms for distributed machine learning.
It employs a three-step approach—routing via MILP, heuristic ordering, and refined scheduling—to optimize data transfers and minimize GPU idle time.
Experimental results show TACCL outperforms NCCL, achieving up to 6.7x faster all-gather and significant speedups in large-scale model training.

An Overview of TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

The paper "TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches" presents an innovative approach to synthesizing efficient collective communication algorithms for machine learning models distributed across multiple GPUs and servers. In distributed machine learning, inter-GPU communication can significantly hinder performance, especially as model sizes continue to expand. This research addresses the critical challenge of optimizing collective communication, which is fundamental for reducing GPU idle time and enhancing the overall performance of distributed training and inference.

The authors introduce TACCL, a system that utilizes communication sketches to guide the automatic generation of collective communication algorithms tailored to specific hardware configurations and communication collectives. The core innovation lies in the use of communication sketches, which allow algorithm designers to provide high-level structural insights while avoiding the exhaustive search of the entire algorithmic space. This human-in-the-loop approach effectively narrows the search space, enabling scalable synthesis that was previously infeasible for multi-node topologies.

Key Contributions

Communication Sketches: Inspired by program sketching, this abstraction allows designers to specify logical topologies, switch-hyperedge policies, expected input sizes, and algorithm symmetries. These guidelines help constrain the search space, making the synthesis process more efficient.
Three-Step Synthesis Approach: TACCL employs a novel three-step method involving:
- Routing: Determining the path for each data chunk using a mixed integer linear program (MILP) to optimize execution time.
- Heuristic Ordering: Implementing a heuristic algorithm to order data transfers on each link.
- Contiguity and Exact Scheduling: Utilizing a refined MILP to balance the trade-offs between bandwidth and latency, thereby finalizing the scheduling of data chunks.
Evaluation and Results: TACCL's synthesized algorithms outperform the Nvidia Collective Communication Library (NCCL) significantly in several scenarios. For instance, TACCL achieves up to 6.7 times faster all-gather operations on DGX-2 nodes and up to 66% faster all-to-all operations on NDv2 nodes. The paper further demonstrates end-to-end training speedups of 11% to 2.4 times for LLMs like Transformer-XL and BERT.
Scalability and Generality: The system successfully synthesizes algorithms for varied topologies, including hierarchical systems like DGX-2 and NDv2 and non-hierarchical structures such as a 2D Torus. The algorithms also scale effectively to larger configurations, such as 10 or more NDv2 nodes.
Open-source Collaboration: TACCL's implementation is shared as open-source, facilitating ongoing research and practical application by both academic and industry practitioners, notably within Microsoft’s Azure platform.

Practical and Theoretical Implications

Practically, TACCL provides a substantial advancement in the efficient use of GPU resources, addressing the bottleneck of inter-GPU communication. The reduction in GPU idle time translates directly into improved computational throughput for large-scale, distributed machine learning models. Theoretically, the introduction of communication sketches and the innovative synthesis approach exemplify how high-level abstractions can significantly enhance algorithmic synthesis processes in computing systems.

Future Directions in AI and ML Systems

The breakthroughs presented in this work suggest several promising avenues for future research and development. TACCL's framework can serve as a foundation for exploring more complex topologies and for developing hierarchical composition techniques to extend scalability. Further research could focus on automated exploration of the communication sketches space, perhaps leveraging machine learning models to predict optimal configurations based on observed patterns and workloads. Additionally, integrating fused communication operations within the TACCL infrastructure could further improve performance in certain scenarios.

The paper on TACCL represents a significant step forward in addressing distributed machine learning communication challenges. By integrating subtle human insights with rigorous computational synthesis methods, it boldly navigates the complexities of modern GPU infrastructures and proposes robust solutions with far-reaching implications.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/msccl: Microsoft Collective Communication Library (350 stars)