Performance Analysis and Comparison of Distributed Machine Learning Systems

Published 4 Sep 2019 in cs.DC | (1909.02061v1)

Abstract: Deep learning has permeated through many aspects of computing/processing systems in recent years. While distributed training architectures/frameworks are adopted for training large deep learning models quickly, there has not been a systematic study of the communication bottlenecks of these architectures and their effects on the computation cycle time and scalability. In order to analyze this problem for synchronous Stochastic Gradient Descent (SGD) training of deep learning models, we developed a performance model of computation time and communication latency under three different system architectures: Parameter Server (PS), peer-to-peer (P2P), and Ring allreduce (RA). To complement and corroborate our analytical models with quantitative results, we evaluated the computation and communication performance of these system architectures of the systems via experiments performed with Tensorflow and Horovod frameworks. We found that the system architecture has a very significant effect on the performance of training. RA-based systems achieve scalable performance as they successfully decouple network usage from the number of workers in the system. In contrast, 1PS systems suffer from low performance due to network congestion at the parameter server side. While P2P systems fare better than 1PS systems, they still suffer from significant network bottleneck. Finally, RA systems also excel by virtue of overlapping computation time and communication time, which PS and P2P architectures fail to achieve.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (25)

View on Semantic Scholar

Summary

The paper demonstrates that the ring allreduce architecture significantly outperforms PS and P2P by efficiently overlapping computation and communication.
The study employs analytical models and experiments using TensorFlow and Horovod with the MNIST dataset to benchmark latency and throughput.
The findings highlight RA's scalability and optimal bandwidth usage, suggesting promising directions for mitigating network congestion in distributed training.

Performance Analysis and Comparison of Distributed Machine Learning Systems

This essay discusses the comparative performance analysis of distributed machine learning architectures used for training deep learning models. The study focuses on evaluating the communication bottlenecks in three parallel training architectures: Parameter Server (PS), Peer-to-Peer (P2P), and Ring Allreduce (RA). It further analyzes their impact on the computation cycle time and scalability using synchronous Stochastic Gradient Descent (SGD).

Introduction to Distributed Training Architectures

Deep Neural Networks (DNNs) have made significant advances in a variety of tasks but require substantial computational resources and large datasets. Distributed training frameworks have been developed to leverage multiple nodes for efficient training. These frameworks must efficiently coordinate state and parameter sharing, imposing challenges in consistency, fault tolerance, and communication overhead.

The PS architecture involves workers that pull model parameters from a server, perform computations, and send updates back to the server. In contrast, the P2P model combines worker and server processes, facilitating local model updates and inter-peer exchanges. In the RA architecture, a ring-based communication pattern is employed, ensuring that gradients are efficiently shared among workers without centralized servers.

Figure 1: Deep Neural Networks.

Performance Modeling of Architectures

The study addresses the performance of PS, P2P, and RA architectures through both analytical and empirical approaches. Analytical models estimate computation and communication times involved in training:

Parameter Server Architecture: Known for potential communication bottlenecks due to centralized server architecture, PS performance degrades when a high number of workers concurrently update parameters, leading to server congestion.
Figure 2: PS Architecture.
Peer-to-Peer Architecture: P2P systems benefit from localized model updates but still suffer from network latency when gradients need to be synchronized across nodes.
Figure 3: P2P based Architecture.
Ring Allreduce Architecture: RA architecture exhibits superior performance as it decouples network usage from the number of workers, achieving high throughput and scalable communication.
Figure 4: RA Architecture.

Experimental Evaluation

The paper conducts experimental evaluations using TensorFlow and Horovod frameworks, with the MNIST dataset as a benchmark. The study measures throughput (training samples per second) and latency for various configurations:

Parameter Server (PS): 1PS systems suffer from high latency due to shared bandwidth limitations. Increasing the number of parameter servers to 2PS and 4PS slightly alleviates congestion but does not substantially increase throughput due to inherent communication centralization.
Figure 5: Estimated epoch time for 1PS.

Figure 6: Measured training throughput of 1PS.
Peer-to-Peer (P2P): Improved over PS due to distributed communication but limited by network saturation when scaling beyond certain nodes.
Figure 7: Estimated epoch time for P2P system.

Figure 8: Measured training throughput of P2P system.
Ring Allreduce (RA): RA outperforms both PS and P2P in terms of both latency and throughput, effectively overlapping computation with communication due to its decentralized nature.
Figure 9: Estimated epoch time for RA.

Figure 10: Measured training throughput of RA.

Implications and Future Work

The RA architecture demonstrates a performance advantage in distributed deep learning tasks due to its efficient use of bandwidth and overlap of compute and communication phases. While PS suffers from network congestion, P2P is constrained by synchronization complexities. Future research could focus on reducing network congestion and optimizing computation-communication trade-offs, facilitating improved scalability and performance in distributed DNN training systems.

Conclusion

In conclusion, the comparative analysis reveals substantial performance disparities among different distributed machine learning architectures. RA's optimal bandwidth usage and inherent scalability make it a preferable choice for large-scale DNN training, while PS and P2P systems require careful configuration to mitigate communication bottlenecks. This work informs better choices in architectural design and tuning for practitioners deploying distributed machine learning systems.

Markdown Report Issue