Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

Published 23 Sep 2024 in cs.DC, cs.AI, and cs.LG | (2409.15241v1)

Abstract: Given the popularity of generative AI, LLMs often consume hundreds or thousands of GPUs for parallelizing and accelerating the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces training and provides generic strategy of fine-grained communication and computation overlapping. Extensive results show that, comparing with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Domino, a method that eliminates communication bottlenecks in LLM training via fine-grained tensor slicing and overlapping.
It uses row and column splits on inputs and weights to overlap computation with communication, resulting in significant performance gains.
Evaluations on Nvidia DGX-H100 clusters with GPT-3 and Llama-2 models demonstrate enhanced scalability and up to 1.3x faster training iterations.

Overview of Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

The paper "Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping" by Guanhua Wang et al. addresses significant communication overhead issues in the training of LLMs over distributed systems. The authors propose Domino, a generic approach designed to mitigate communication bottlenecks by effectively hiding communication behind computation using fine-grained tensor slicing techniques.

Introduction and Problem Statement

In the context of increasing demand for Generative AI applications such as chatbots and content generation, LLMs have grown exponentially in terms of model sizes, often spanning tens to hundreds of billions of parameters. Given their computational and memory requirements, these models necessitate the use of hundreds or thousands of GPUs across multiple nodes. This distributed setup, while essential, introduces substantial communication overheads, particularly in tensor parallelism (TP), a popular model parallelism technique.

Existing Distributed Training Paradigms

The paper distinguishes three primary paradigms for distributed transformer model training: data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). While DP maintains copies of the entire model on every GPU, TP and PP distribute different parts of the model across GPUs. Specifically, TP, which divides each layer across multiple GPUs, has garnered attention due to the relatively high system efficiency on Nvidia GPUs connected via high-bandwidth interconnects like NVLink and NVSwitch.

However, even with advancements in interconnect technology, the communication overhead remains a critical bottleneck. The paper identifies that typical TP incurs significant global communication per layer, resulting in inefficiencies as these communications are on the critical path of training execution.

Domino Approach

Domino distinguishes itself by providing a generic methodology to overlap communication with computation. Unlike previous approaches that focused on limited kernel fusion strategies, Domino employs a broader scope of tensor slicing on both input data and model weights:

Row Split on Inputs: The input tensors (e.g., the activation matrices) are partitioned along the batch dimension. This partitioning ensures that each smaller tensor operates independently, allowing their communication to be overlapped with subsequent computations. This independence effectively eliminates data dependency across partitions, achieving better communication hiding within and across transformer layers.
Column Split on Weights: Similarly, model weights are partitioned along the last dimension. Each partition results in partial computations that can be asynchronously reduced and concatenated, further hiding communication overheads and ensuring efficient computation.
Hybrid Split: By combining both row and column splits, Domino can finely adjust the granularity of overlapping computations and communications, providing flexibility to optimize performance across various model sizes and system configurations.

Implementation and Results

The authors implemented Domino within the Nvidia DGX-H100 cluster environment, leveraging Nvidia’s NCCL for collective communication. The optimization techniques included kernel-level enhancements such as CUDA MultiStream, torch.compile() functionality for JIT compilation of PyTorch operations, and CUDA Graph for reducing kernel launching overhead.

Extensive benchmarks using GPT-3 and Llama-2 models on single-node and multi-node setups demonstrated Domino's effectiveness. The results highlight up to 1.3x speedup in training iteration times compared to the standard Megatron-LM implementation, with some cases outperforming even the theoretically optimal scenario that eliminates all communication overhead.

Implications and Future Directions

The practical implications of Domino are profound. By achieving significant speedups without necessitating extensive changes to existing training pipelines, Domino can play a critical role in enhancing the scalability and efficiency of LLM training regimes. On the theoretical side, the work lays a foundation for new research avenues focusing on fine-grained tensor operations and more sophisticated computation-communication overlapping strategies.

Future developments could include further refinement of the hybrid splitting strategy, adaptive partitioning schemes tailored to various hardware and network configurations, and extending the framework to other emerging accelerator architectures. Additionally, as the bandwidth capabilities of interconnects continue to evolve, Domino's principles could underpin more advanced optimizations, bridging the gap between current and next-generation distributed LLM training environments.

In conclusion, Domino represents a significant advancement in addressing the communication overhead challenges in distributed LLM training. Its approach of generic tensor slicing and overlapping not only enhances training performance but also ensures scalability and adaptability across diverse hardware configurations.

Markdown