Priority-based Parameter Propagation for Distributed DNN Training (1905.03960v1)

Published 10 May 2019 in cs.DC and cs.LG

Abstract: Data parallel training is widely used for scaling distributed deep neural network (DNN) training. However, the performance benefits are often limited by the communication-heavy parameter synchronization step. In this paper, we take advantage of the domain specific knowledge of DNN training and overlap parameter synchronization with computation in order to improve the training performance. We make two key observations: (1) the optimal data representation granularity for the communication may differ from that used by the underlying DNN model implementation and (2) different parameters can afford different synchronization delays. Based on these observations, we propose a new synchronization mechanism called Priority-based Parameter Propagation (P3). P3 synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay. We show that P3 can improve the training throughput of ResNet-50, Sockeye and VGG-19 by as much as 25%, 38% and 66% respectively on clusters with realistic network bandwidth

Authors (5)

Anand Jayarajan (6 papers)
Jinliang Wei (9 papers)
Garth Gibson (2 papers)
Alexandra Fedorova (7 papers)
Gennady Pekhimenko (52 papers)

Citations (173)

View on Semantic Scholar

Summary

Priority-based Parameter Propagation for Distributed DNN Training: A Detailed Examination

The paper, "Priority-based Parameter Propagation for Distributed DNN Training" by Jayarajan et al., addresses a significant bottleneck in the scalability of distributed Deep Neural Network (DNN) training: the communication overhead associated with parameter synchronization. By leveraging domain-specific insights into DNN training dynamics, the authors propose a novel synchronization mechanism, Priority-based Parameter Propagation (P3), designed to mitigate communication delays and enhance training throughput on constrained bandwidth networks.

Core Observations and Propositions

This paper hinges on two pivotal observations:

Optimal Granularity Variability: The optimal data granularity for communication may differ from that used in the underlying DNN model implementation.
Differential Synchronization Delays: Different model parameters can tolerate varying amounts of synchronization delay without adversely impacting training performance.

Based on these insights, P3 introduces two main components: Parameter Slicing and Priority-based Update. These components allow P3 to synchronize parameters at a finer granularity and prioritize data transmission based on the immediate need of the parameters in subsequent computations.

Performance and Implementation

The authors' implementation of P3 exhibits robust improvements in training throughput across several DNN architectures, notably achieving a maximum throughput improvement of 25% for ResNet-50, 38% for Sockeye, and 66% for VGG-19. These results are particularly encouraging given the context of various cluster configurations with realistic network bandwidth constraints. The detailed implementation in MXNet demonstrates P3’s model-agnostic properties and underscores the minimal effort required to integrate P3 into existing systems.

Theoretical and Practical Implications

Theoretically, P3 offers a new perspective on communication-computation overlap in distributed ML systems, challenging the traditional notions underpinning synchronous SGD frameworks. Practically, P3's ability to reduce communication bottlenecks without compromising model convergence makes it a valuable asset for cloud and academic infrastructures struggling with budget constraints in bandwidth deployment.

The insights provided by this research extend beyond specific implementations. They present a scalable synchronization approach that can be adapted potentially to other distributed training methodologies, including alternatives to parameter server architecture like MPI-based methods. Furthermore, P3’s techniques could integrate with gradient compression approaches, enhancing performance while mitigating potential accuracy losses inherent to lossy compression techniques.

Future Research Directions

Moving forward, further explorations could delve into automating the derivate slice-size determination within P3 to maximize efficiency across varied network conditions dynamically. Additionally, extending the evaluation to more diverse DNN architectures, training sets, and hyperparameter configurations would provide a broader understanding of P3's applicability. The exploration of hybrid combinations of P3 with emerging hardware-centric acceleration techniques, such as hardware-specific adaptability features, could further enhance distributed DNN training processes.

In conclusion, this research contributes substantial advancements in the field of distributed machine learning, providing insights that are readily adaptable to current frameworks. P3 stands as a testament to the potential of informed synchronization scheduling, opening new avenues for further innovations in ML training methodologies.

PDF Markdown

Related Papers

Find Related Papers