TicTac: Accelerating Distributed Deep Learning with Communication Scheduling (1803.03288v2)

Published 8 Mar 2018 in cs.DC, cs.LG, and cs.PF

Abstract: State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity of models and input data. The iteration time in these communication-heavy systems depends on the computation time, communication time and the extent of overlap of computation and communication. In this work, we identify a shortcoming in systems with graph representation for computation, such as TensorFlow and PyTorch, that result in high variance in iteration time --- random order of received parameters across workers. We develop a system, TicTac, to improve the iteration time by fixing this issue in distributed deep learning with Parameter Servers while guaranteeing near-optimal overlap of communication and computation. TicTac identifies and enforces an order of network transfers which improves the iteration time using prioritization. Our system is implemented over TensorFlow and requires no changes to the model or developer inputs. TicTac improves the throughput by up to $37.7\%$ in inference and $19.2\%$ in training, while also reducing straggler effect by up to $2.3\times$. Our code is publicly available.

Citations (184)

View on Semantic Scholar

Summary

The paper introduces TicTac, a communication scheduling framework addressing performance challenges in distributed deep learning systems by optimizing computation-communication overlap.
TicTac proposes resource-aware scheduling using TIC and TAC heuristics, demonstrating up to 37.7% throughput improvement and a 2.3 times reduction in straggler effect.
This approach improves efficiency and reduces training times for large-scale AI workloads, offering significant practical benefits in cloud computing environments.

An Insightful Overview of "TicTac: Accelerating Distributed Deep Learning with Communication Scheduling"

The paper "TicTac: Accelerating Distributed Deep Learning with Communication Scheduling" by Hashemi et al. addresses the performance challenges of distributed deep learning systems by introducing a communication scheduling framework called TicTac. As the size and complexity of deep learning models continue to grow, efficient distributed training becomes crucial. The work's contribution lies in improving the iteration times of such systems by optimizing the overlap of computation and communication processes, which are typically represented using computational graphs in platforms like TensorFlow and PyTorch.

Core Contributions

The paper identifies the inherent inefficiency in current deep learning systems' scheduling of parameter transfers, which can lead to high variability in iteration times. This variability is primarily due to the arbitrary order of received parameters across workers, a challenge not adequately addressed by existing solutions, especially in modern systems with Directed Acyclic Graphs (DAG) representations.

TicTac makes the following key contributions:

Performance Optimization in DAG-Based Systems: The paper establishes the inefficiency of random parameter transfer ordering in DAG-based systems and proposes a resource-aware scheduling solution.
Scheduling Efficiency Metric: It introduces a novel metric to evaluate the scheduling efficiency, allowing for quantitative comparison of different schedules.
Heuristic Algorithms for Scheduling: The development of two heuristics—TIC (Timing-Independent Communication Scheduling) and TAC (Timing-Aware Communication Scheduling)—provides near-optimal schedules for parameter transfers that enhance computation-communication overlap.
Implementation and Evaluation: The authors implement TicTac over TensorFlow, demonstrating its capability to improve throughput by up to 37.7% in inference and 19.2% in training in the environments tested.

Numerical Results and Implications

The quantitative improvements presented in the paper are significant. TicTac's ability to enhance throughput and mitigate the straggler effect (by up to 2.3 times reduction) illustrates its practical efficacy. The system is comprehensive in its applicability, with tests conducted across diverse models and configurations. The authors note that even a slight improvement in communication overhead can substantially reduce learning time in long-running jobs, an assertion backed by their empirical findings.

Theoretical and Practical Implications

TicTac's approach to scheduling emerges as a crucial development for both academia and industry. Theoretically, the work emphasizes the importance of communication-computation overlap, a dimension often underestimated in performance optimization studies. The paper could influence future research, where more sophisticated scheduling algorithms could be designed, possibly incorporating real-time network variations.

Practically, TicTac's adoption can lead to considerable reductions in training times, especially in large-scale AI workloads that distribute models across multiple devices. This improvement translates to cost savings and energy efficiency, key concerns in cloud computing environments.

Future Directions

While TicTac demonstrates promising results, it opens several avenues for further exploration:

Incorporation of dynamic network conditions and real-time adjustments in the scheduling heuristics.
Extension of the scheduling mechanism to other distributed training paradigms such as all-reduce and decentralized architectures like Horovod.
Exploration of multi-resource scheduling that synergizes network optimization with other computational resources like memory and disk I/O.

In conclusion, the paper presents a compelling case for the necessity and efficacy of communication scheduling in distributed deep learning systems. The work sets a foundation that subsequent studies could build upon, aiming towards more adaptive and efficient distributed training solutions in increasingly complex and resource-intensive AI applications.