Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 31 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 11 tok/s Pro

GPT-5 High 9 tok/s Pro

GPT-4o 77 tok/s Pro

Kimi K2 198 tok/s Pro

GPT OSS 120B 463 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Is Network the Bottleneck of Distributed Training? (2006.10103v3)

Published 17 Jun 2020 in cs.DC and cs.LG

Abstract: Recently there has been a surge of research on improving the communication efficiency of distributed training. However, little work has been done to systematically understand whether the network is the bottleneck and to what extent. In this paper, we take a first-principles approach to measure and analyze the network performance of distributed training. As expected, our measurement confirms that communication is the component that blocks distributed training from linear scale-out. However, contrary to the common belief, we find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one. Moreover, while many recent proposals on gradient compression advocate over 100x compression ratio, we show that under full network utilization, there is no need for gradient compression in 100 Gbps network. On the other hand, a lower speed network like 10 Gbps requires only 2x--5x gradients compression ratio to achieve almost linear scale-out. Compared to application-level techniques like gradient compression, network-level optimizations do not require changes to applications and do not hurt the performance of trained models. As such, we advocate that the real challenge of distributed training is for the network community to develop high-performance network transport to fully utilize the network capacity and achieve linear scale-out.

Citations (66)

View on Semantic Scholar

Summary

The paper finds that network capacity is underutilized in distributed training, with scaling factors of 60-76% observed when using 64 GPUs.
It employs detailed profiling on AWS with PyTorch and Horovod, revealing low network and CPU utilization even at 100 Gbps bandwidth.
The study advocates shifting focus to network transport protocol optimizations, which could potentially achieve near-linear scaling up to 99%.

Summary of "Is Network the Bottleneck of Distributed Training?"

The study in "Is Network the Bottleneck of Distributed Training?" investigates the role of network capacity in distributed deep learning systems by analyzing whether network limitations obstruct linear scalability. It postulates that, despite widespread assumptions, the network is not the primary bottleneck for scaling distributed training. The research advocates for focusing on enhancing network transport efficiency over application-layer optimization strategies like gradient compression.

Key Findings

Network Utilization and Performance Impacts

The paper establishes that current distributed training frameworks, specifically Horovod, reach a maximum scaling factor of approximately 60-76% when employed with 64 GPUs under optimal network configurations. However, the study reveals that this network underutilization is not synonymous with performance limitations due to bandwidth shortages.

Figure 1: Scaling factor vs. number of servers involved.

Profiling and Analysis

Through a detailed profiling of distributed training on AWS infrastructure using PyTorch and Horovod, the research indicates that merely enhancing network speed does not proportionally improve performance. Although communication phases in distributed training tend to bog down scaling, the network itself operates at a low utilization rate even at available 100 Gbps bandwidth.

Figure 2: Computation time vs. number of servers.

Alternative Bottlenecks

Contrary to common beliefs, the paper postulates that neither network bandwidth nor CPU utilization acts as prevailing bottlenecks in the experimental settings. The findings are derived from measuring network and CPU utilization, revealing active bandwidth usage well below capacity and nominal CPU usage during high-intensity operations.

Figure 3: Scaling factor change with bandwidth (ResNet50).

What-If Analysis

The exploration further performs a "what-if” analysis to simulate conditions of perfect network utilization. It reveals that if the network were adequately optimized, distributed training could achieve scaling factors nearing 99%, diminishing the perceived need for application-level gradient compression strategies.

Figure 4: Simulated scaling factor vs measured scaling factor in different bandwidth.

Implications and Future Directions

Shift in Optimization Focus

The study underscores the potential of network-layer optimizations to facilitate significant improvements in distributed training throughput without a loss of model performance or added complexity. It advocates shifting focus away from application-layer optimizations like gradient compression—unless dealing with low-bandwidth environments—and instead calls for the design of proficient network transport protocols.

Broader Impact

This research emphasizes the crucial role of streamlined network interaction in distributed AI systems, inviting advancements towards kernel-bypass technologies and high-performance network transport systems that can fully exploit the bandwidth of modern high-speed networks.

Conclusion

"Is Network the Bottleneck of Distributed Training?" establishes a clear narrative that the inefficiencies in distributed training are primarily due to suboptimal network transport implementations rather than inherent bandwidth limitations. It stresses the potential gains from harnessing high-performance network technologies that better utilize existing network infrastructure, pushing towards nearly linear scalability in distributed deep learning frameworks.