Distributed Deep Learning Using Synchronous Stochastic Gradient Descent (1602.06709v1)

Published 22 Feb 2016 in cs.DC and cs.LG

Abstract: We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record training throughputs. A 512 minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64 node cluster. We also demonstrate the generality of our approach via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt to democratize deep-learning by training on an Ethernet based AWS cluster and show ~14X scaling on 16 nodes.

Authors (8)

Dipankar Das (86 papers)
Sasikanth Avancha (20 papers)
Dheevatsa Mudigere (35 papers)
Karthikeyan Vaidynathan (1 paper)
Srinivas Sridharan (24 papers)
Dhiraj Kalamkar (15 papers)
Bharat Kaul (23 papers)
Pradeep Dubey (31 papers)

Citations (166)

View on Semantic Scholar

Summary

The paper's main contribution is a scalable synchronous SGD framework for distributed training that achieves up to 90-fold speedup on multi-node clusters.
The methodology leverages balance equations, cache blocking, and SIMD vectorization strategies to enhance CPU-based training efficiency.
The framework offers a viable alternative to GPU systems by optimally training CNNs and ASR networks, paving the way for cost-effective large-scale deep learning.

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

The paper "Distributed Deep Learning Using Synchronous Stochastic Gradient Descent" presents a comprehensive paper and implementation framework intended for distributed training of deep neural networks (DNNs), particularly focusing on Convolutional Neural Networks (CNNs) and Automatic Speech Recognition (ASR) networks. The primary contribution lies in enhancing synchronous Stochastic Gradient Descent (SGD) scaling across multiple nodes without modifying hyperparameters, applying compression techniques, or altering algorithmic procedures.

Core Findings

The authors delve into scaling analysis for deep learning workflows using CNNs, demonstrating an impressive training throughput. Specifically, they achieve a 90-fold scalability on a 128-node cluster for VGG-A CNN with a minibatch size of 512. Other significant results include a 53-fold and 42-fold scalability for VGG-A and OverFeat-FAST networks respectively, using a 64-node setup. Furthermore, the paper highlights a 6.5-fold scalability on a smaller 16-node cluster for a 7-layer ASR DNN, showcasing the general applicability of their approach.

The paper reveals record capability in single-node efficiency, achieving approximately 90% efficiency for convolutional layers and 70% for fully connected layers on Intel Xeon processors. These results are attributed to a systematic exploration of balance equations, optimal CPU cache usage, memory bandwidth optimization, SIMD-utilized vectorization, and register blocking strategies.

Practical and Theoretical Implications

The paper has important implications for initializing and optimizing large-scale training of neural networks. The authors' analysis and results offer valuable insights into efficient CPU-based training of deep neural networks, reflecting a potential shift from GPU-centric solutions to CPU-based distributed systems that are generally more accessible and cost-effective, considering the hypothetical good scaling of the mentioned frameworks.

The theoretical contributions include meticulous balance equations for computation and communication strategies, showcasing a potential hybrid model combining data and model parallelism. Moreover, the detailed investigation into cache blocking and work partitioning strategies across multiple nodes suggests directions for further verification of data parallelism scaling limits.

Future Directions

Given the bold claims and numerically strong results, future work could explore the implications of distributed deep learning beyond existing architecture types and assess the framework's adaptability to emerging hardware technologies such as AI accelerators and novel interconnect strategies. Moreover, further research could investigate the application of this scaling framework to other types of neural network models or challenges posed by increased model complexity on training efficiency.

The paper serves as a compelling reference for exploring advanced parallelism strategies in deep learning, particularly those using synchronous SGD, and may inspire further investigation into multi-node optimizations and frameworks.

PDF Markdown