PyTorch Distributed: Experiences on Accelerating Data Parallel Training (2006.15704v1)

Published 28 Jun 2020 in cs.DC and cs.LG

Abstract: This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

Citations (152)

View on Semantic Scholar

Summary

The paper introduces a non-intrusive API for distributed training that guarantees mathematical equivalence with local model updates.
It employs gradient bucketing and overlaps computation with communication to reduce latency and enhance throughput.
Empirical evaluations on ResNet50 and BERT models demonstrate near-linear scalability with 256 GPUs, validating significant performance gains.

Overview of PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The paper discusses the design, implementation, and evaluation of the PyTorch distributed data parallel module. Given the increasing importance of large datasets and models in deep learning, there is a critical need to leverage more computational resources efficiently. Data parallelism, which involves replicating models across computational resources, ensuring independent gradient generation and synchronization, addresses this requirement.

Key Contributions and Techniques

The PyTorch module aims to achieve three main objectives: ensure mathematical equivalence with local training, introduce a non-intrusive API, and provide high-performance optimization. It provides several techniques for accelerating distributed data parallelism:

Bucketing Gradients: Instead of conducting separate synchronization for each gradient, the paper suggests amalgamating multiple gradients into one bucket and synchronizing them collectively.
Overlapping Computation with Communication: The paper emphasizes that computations and communications should be harmonized to leverage parallelism effectively, consequently shortening training durations.
Skipping Synchronizations: The module optionally allows skipping some synchronization operations to reduce overhead. Evaluations indicate significant performance gains without a substantial impact on convergence under appropriate conditions.

Evaluation and Findings

The PyTorch distributed data parallel module was tested using ResNet50 and BERT models across various GPU configurations. Experiments demonstrated near-linear scalability with 256 GPUs, affirming the utility of their optimizations. Significant performance improvements were seen when using the NCCL backend compared to Gloo, which suggests a substantial communication bottleneck in the latter.

Additionally, the studies highlighted the following findings:

The backward pass involving gradient synchronization is particularly latency-intensive.
Optimal bucket sizes appear to vary based on model and hardware specifics. For instance, a compromise in bucket size yields the best results for ResNet50 and BERT metrics under the given conditions.
Implementing no-synchronization modes judiciously shows substantial latency reductions without significantly affecting model training accuracy.

Implications and Future Directions

The advancements presented by PyTorch’s distributed data parallel module are pivotal for deep learning frameworks. They set a benchmark in efficiently harnessing multiple GPUs for training large models and datasets. However, potential future improvements include:

Enhancing dynamic bucket management with predictive methods for gradient order.
Greater synergy between layer-dropping techniques and distributed communication efficiencies.
Exploring gradient compression to further minimize communication redundancies.

These enhancements would not only augment the existing framework but also address latent inefficiencies potentially inhibiting wide-scale adoption across diverse applications and architectures. In closing, this work epitomizes a comprehensive approach to optimizing distributed training, offering both empirical insights and actionable strategies for practitioners in the field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/stochasticchasm/status/1861141663093059800