DAPPLE: A Pipelined Data Parallel Approach for Training Large Models (2007.01045v1)

Published 2 Jul 2020 in cs.DC

Abstract: It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategy of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23x under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6x speedup of training throughput and reduces the memory consumption of 12% at the same time.

PDF Abstract

Overview of "DAPPLE: A Pipelined Data Parallel Approach for Training Large Models"

The paper "DAPPLE: A Pipelined Data Parallel Approach for Training Large Models" introduces a novel framework, DAPPLE, designed to enhance the efficiency of deep neural network (DNN) training on sophisticated GPU platforms, particularly when dealing with large-scale models. The framework addresses several key challenges in DNN training, including improving computational efficiency, ensuring convergence, and managing memory usage without compromising performance. DAPPLE combines data parallelism and pipeline parallelism, introducing innovative methods for partitioning and placing model layers on interconnected devices.

Key Contributions

Hybrid Parallelism: DAPPLE leverages a hybrid approach combining data and pipeline parallelism, facilitating the distributed training of large DNN models. It dynamically partitions model layers into stages, distributing them across a set of interconnected devices. This approach takes into account the complexities of modern GPU interconnects, ensuring efficient data flow and maintaining convergence by handling gradient synchronization effectively.
Optimal Parallelization Strategy: The framework features a parallelization strategy planner that automatically generates optimal hybrid strategies for training iterations. This planner incorporates both execution time optimization and memory usage considerations, ensuring that the training process is efficient and scalable across different hardware configurations.
Pipeline Stage Scheduling: A novel runtime scheduling algorithm is introduced to mitigate the memory consumption typically associated with pipeline parallelism. This algorithm interleaves forward and backward computations, reducing peak memory usage and eliminating the need for storing multiple parameter versions.
Performance Evaluations: The paper presents comprehensive experiments demonstrating significant speedups over existing strategies, such as those generated by PipeDream and GPipe. Notably, the DAPPLE planner outperforms alternative plans by up to 3.23x in synchronous training scenarios, and the runtime achieves up to 1.6x speedup in training throughput with 12% less memory usage.

Implications and Future Directions

From a practical perspective, DAPPLE's approach enables more efficient utilization of GPU resources for large-scale model training, potentially reducing the cost and energy consumption associated with such tasks. The theoretical implications include advancing the understanding of hybrid parallelism in DNN training and opening avenues for further research into optimization techniques that can further reduce overhead and enhance scalability.

Looking ahead, the AI community could focus on extending DAPPLE-like frameworks to accommodate new hardware developments and emerging interconnect technologies. Additionally, exploring the application of DAPPLE to different model architectures and domains, such as natural language processing and recommendation systems, could yield further insights into its adaptability and performance benefits across various use cases.

In conclusion, DAPPLE represents a significant step towards optimizing the training process for large-scale DNN models, balancing the demands for computational efficiency, model convergence, and memory usage through its innovative combination of data and pipeline parallelism.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Shiqing Fan (10 papers)
Yi Rong (12 papers)
Chen Meng (10 papers)
Zongyan Cao (5 papers)
Siyu Wang (55 papers)
Zhen Zheng (39 papers)
Chuan Wu (68 papers)
Guoping Long (12 papers)
Jun Yang (357 papers)
Lixue Xia (1 paper)
Lansong Diao (10 papers)
Xiaoyong Liu (6 papers)
Wei Lin (207 papers)

Citations (191)

View on Semantic Scholar

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models (2007.01045v1)

Overview of "DAPPLE: A Pipelined Data Parallel Approach for Training Large Models"

Key Contributions

Implications and Future Directions

Related Papers