Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training (1712.01887v3)

Published 5 Dec 2017 in cs.CV, cs.DC, cs.LG, and stat.ML

Abstract: Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and LLMing with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile. Code is available at: https://github.com/synxlin/deep-gradient-compression.

PDF Abstract

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

In the paper "Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training," the authors address a critical bottleneck in large-scale distributed training: the significant communication bandwidth required for gradient exchange. Synchronous SGD, a widely used method for distributed training, often incurs high communication costs that can dwarf the computational savings obtained by parallelizing processing across multiple nodes. This issue is exacerbated in federated learning scenarios, where mobile devices with unreliable and slow network connections participate in distributed training.

Key Techniques

The authors introduce Deep Gradient Compression (DGC), a framework that drastically reduces communication bandwidth without sacrificing model accuracy. DGC employs several key techniques:

Gradient Sparsification: Only significant gradients (ones with magnitudes above a threshold) are transmitted, reducing the volume of data exchanged.
Momentum Correction: Adjusts for the discounting factor between sparse updates to ensure convergence properties analogous to dense updates.
Local Gradient Clipping: Prevents the explosion of gradients by clipping them before local accumulation.
Momentum Factor Masking: Prevents stale momentum by masking momentum updates for delayed gradients.
Warm-up Training: Gradually increases gradient sparsity at early training stages to adaptively handle the rapid changes in network parameters.

Experimental Verification

The authors validate DGC across several neural network architectures (CNNs and RNNs) and tasks:

Image Classification: Using ResNet-110 on Cifar10 and ResNet-50 on ImageNet.
LLMing: Using a 2-layer LSTM on the Penn Treebank dataset.
Speech Recognition: Using DeepSpeech on AN4 and Librispeech corpora.

The experiments demonstrate that DGC achieves gradient compression ratios ranging from 270x to 600x without compromising on model accuracy. For instance, DGC was able to reduce the gradient size of ResNet-50 from 97MB to 0.35MB and DeepSpeech from 488MB to 0.74MB. The detailed results of the experiments are summarized as follows:

Image Classification

Cifar10: DGC preserved or even slightly improved the accuracy across different batch sizes and scales.
ImageNet: DGC achieves 597x compression on AlexNet and 277x on ResNet-50 while maintaining the training accuracy close to the baseline.

LLMing

Penn Treebank: DGC maintained the perplexity close to the baseline while achieving a 462x compression rate.

Speech Recognition

LibriSpeech: DGC showed slight improvements in word error rate on both clean and noisy speech datasets while achieving 608x compression.

Implications and Future Directions

The results suggest significant implications for both practical and theoretical aspects of distributed training:

Scalability: By reducing the required communication bandwidth, DGC enables training across a larger number of nodes even on inexpensive, low-bandwidth networks, such as 1Gbps Ethernet.
Accessibility: DGC facilitates federated learning on mobile devices by mitigating the adverse effects of limited and intermittent network connectivity.
Efficiency: DGC considerably reduces the operational costs associated with bandwidth and improves the speedup and scalability of distributed training frameworks.
Robustness: By preserving training accuracy across a wide range of compression ratios and architectures, DGC establishes a robust groundwork for further research into efficient communication strategies in distributed machine learning systems.

Conclusion

The paper puts forth Deep Gradient Compression as an efficient, scalable method for reducing the communication bandwidth in distributed training without accuracy loss. The extensive empirical validation across various neural network models and datasets demonstrates that DGC can achieve compression ratios of up to 600x, significantly alleviating the communication bottleneck. Future research may investigate applying DGC broadly across diverse neural architectures and further optimize its implementation for specific hardware and network configurations.