Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Horovod: fast and easy distributed deep learning in TensorFlow (1802.05799v3)

Published 15 Feb 2018 in cs.LG and stat.ML

Abstract: Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod

Citations (1,150)

Summary

  • The paper's main contribution is the development of Horovod, which simplifies distributed training by requiring minimal code modifications in TensorFlow.
  • It employs the ring-allreduce algorithm and Tensor Fusion to significantly reduce communication overhead and improve performance by up to 65%.
  • The integration with NVIDIA NCCL and Horovod Timeline enhances debugging and scalability, achieving an 88% efficiency in multi-GPU training.

Horovod: Fast and Easy Distributed Deep Learning in TensorFlow

The paper "Horovod: fast and easy distributed deep learning in TensorFlow" by Alexander Sergeev and Mike Del Balso details the development and implementation of Horovod, an open-source library that enhances distributed deep learning capabilities in TensorFlow. The authors address significant issues related to multi-GPU training, namely inter-GPU communication overhead and the extensive code modifications typically required.

Motivation and Background

The motivation for Horovod stems from the inefficiencies observed with the standard distributed TensorFlow package. For many researchers at Uber, the steep learning curve and communication overhead inherent in TensorFlow's distributed training API created barriers to efficient multi-GPU training. The authors note that their benchmarks revealed substantial GPU resource underutilization, particularly when scaling up to 128 GPUs, where nearly half of the resources were wasted.

In response, Sergeev and Del Balso explored Facebook’s demonstrated success of training a ResNet-50 network in one hour using 256 GPUs through a data-parallel approach combined with innovative learning rate adjustments. This realization guided Uber's team to investigate more efficient algorithms for distributed training, which eventually led them to the ring-allreduce algorithm as an effective solution.

Key Contributions

Horovod's primary contributions are encapsulated in several technical implementations and improvements, which the authors outline thoroughly:

  1. Simplified API: Horovod significantly reduces the complexity of converting single-GPU to distributed training by requiring minimal code changes. The initialization (hvd.init()), assignment of GPU devices (config.gpu_options.visible_device_list = str(hvd.local_rank())), wrapping of TensorFlow optimizers (opt = hvd.DistributedOptimizer(opt)), and broadcasting of variables (hvd.BroadcastGlobalVariablesHook(0)) are the critical steps.
  2. Ring-Allreduce Algorithm: The paper explains the benefits of employing the ring-allreduce algorithm for gradient averaging and communication. This method minimizes the communication overhead and network saturation that plagues the parameter server approach, fostering bandwidth-optimal communication patterns.
  3. NCCL Integration: Horovod integrates with NVIDIA's NCCL, which offers optimized performance for collective communication operations. The transition to NCCL 2 allows Horovod to operate efficiently across multiple machines.
  4. Tensor Fusion: To deal with the inefficiency of small tensors, the authors introduced Tensor Fusion, which aggregates small tensors before ring-allreduce is performed. This optimization yielded up to 65% performance improvements on certain models under TCP networks.
  5. Horovod Timeline: For debugging and performance profiling, Horovod provides a tool called Horovod Timeline, which visualizes the state of each worker node during training. This tool is compatible with Chrome's trace event profiling viewer and helps users diagnose and improve their distributed training jobs.

Benchmarks and Performance

The authors conducted extensive benchmarking to validate Horovod's efficacy. When pitted against standard distributed TensorFlow, Horovod demonstrated remarkable improvements in scalability and resource utilization. As illustrated by their results, Horovod achieved an 88% efficiency mark in scaling, effectively doubling the training speed over standard TensorFlow.

Furthermore, testing with RDMA-capable networking showed additional gains, particularly with models like VGG-16, which saw a 30% speedup due to its communication bottleneck being alleviated by RDMA.

Implications and Future Work

The practical implications of Horovod are extensive. By reducing the barrier to adopting distributed training, Horovod enables more efficient use of GPU resources, thereby shortening model training times and accelerating research and deployment cycles. This is particularly beneficial in industrial settings like Uber, where large-scale data and model training play a crucial role in various applications from autonomous driving to fraud prevention.

The authors also outline ongoing and future work that includes simplifying MPI installation, sharing strategies for model hyperparameter adjustment in distributed environments, and expanding support for very large models that span multiple GPUs across servers.

Conclusion

Horovod represents a significant advancement in the field of distributed deep learning, addressing both the performance and usability challenges associated with multi-GPU training in TensorFlow. By integrating efficient communication algorithms and providing a user-friendly interface, Horovod facilitates faster and more scalable training processes. This work will likely serve as a foundational tool for researchers and practitioners aiming to leverage distributed computing for deep learning tasks. The continued development and community engagement around Horovod promise further optimizations and broader adoption in the future.

Github Logo Streamline Icon: https://streamlinehq.com