AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training (1712.02679v1)

Published 7 Dec 2017 in cs.LG and stat.ML

Abstract: Highly distributed training of Deep Neural Networks (DNNs) on future compute platforms (offering 100 of TeraOps/s of computational capacity) is expected to be severely communication constrained. To overcome this limitation, new gradient compression techniques are needed that are computationally friendly, applicable to a wide variety of layers seen in Deep Neural Networks and adaptable to variations in network architectures as well as their hyper-parameters. In this paper we introduce a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme. AdaComp is based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity. We show excellent results on a wide spectrum of state of the art Deep Learning models in multiple domains (vision, speech, language), datasets (MNIST, CIFAR10, ImageNet, BN50, Shakespeare), optimizers (SGD with momentum, Adam) and network parameters (number of learners, minibatch-size etc.). Exploiting both sparsity and quantization, we demonstrate end-to-end compression rates of ~200X for fully-connected and recurrent layers, and ~40X for convolutional layers, without any noticeable degradation in model accuracies.

Authors (6)

Chia-Yu Chen (7 papers)
Jungwook Choi (28 papers)
Daniel Brand (4 papers)
Ankur Agrawal (10 papers)
Wei Zhang (1489 papers)
Kailash Gopalakrishnan (12 papers)

Citations (168)

View on Semantic Scholar

Summary

Overview of AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

The paper "AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training" introduces a novel approach to enhance the efficiency of distributed deep learning (DL) training by addressing communication bottlenecks through an innovative gradient compression technique. The authors propose the AdaComp scheme, which adaptively compresses gradient residues to mitigate the communication constraints prevalent in highly distributed systems.

The central proposition of this research is the AdaComp technique, which dynamically adjusts the compression rate of gradient residues based on local activity. This method achieves significant compression without degrading the model's accuracy. The authors demonstrate the efficacy of AdaComp across a variety of DL models, datasets, and optimization methods, showcasing its robustness and universality.

Key Contributions

Compression Scheme Evaluation: The paper provides a critical evaluation of existing gradient compression methods, highlighting their limitations in handling the diversity seen in typical neural networks. The authors point out that prior schemes largely focus on fully-connected (FC) layers and fall short when applied to a mix of layer types that include convolutional and recurrent layers.
Adaptive Compression Technique: AdaComp employs localized selection for gradient residues, automatically tuning the compression rate by analyzing activity at a local level. This adaptability leads to compression rates of approximately $200\times$ for FC and Long Short-Term Memory (LSTM) layers, and about $40\times$ for convolutional layers.
Empirical Validation: The paper elaborates on empirical results obtained from testing AdaComp on diverse neural architectures (CNNs, DNNs, LSTMs), datasets (such as MNIST, CIFAR10, ImageNet), and optimizers (SGD with momentum, Adam). These experiments confirm that AdaComp maintains model accuracy while drastically reducing communication overhead.
Optimization and System Agnosticism: AdaComp is shown to be agnostic to specific internal DL optimizers and system configurations, allowing flexibility across different training contexts. The adaptation is primarily driven by localized selection mechanisms and relies on only one hyper-parameter for achieving high compression rates.

Implications and Future Directions

The implications of AdaComp's successful compression extend to both theoretical advances in compression strategies and practical applications in distributed deep learning frameworks. The adaptive nature of AdaComp addresses the critical balance between computational throughput and communication bandwidth, especially vital as the scale and complexity of DL models continue to grow.

In future developments, exploring enhancements to AdaComp could involve its application in more varied neural network architectures, including emerging transformer models. Additionally, research could be directed at further optimizing the balance between compression efficiency and computational cost, examining how the approach scales with next-generation DL accelerators.

Conclusion

Overall, this research provides a substantial contribution to the field of distributed DL training by addressing a pertinent issue of communication constraints through an innovative and adaptive compression technique. AdaComp's ability to handle diverse conditions across various architectures and datasets presents a promising advancement in efficient DL training methodologies.