1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed (2102.02888v2)

Published 4 Feb 2021 in cs.LG and cs.DC

Abstract: Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to $5\times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam's variance (non-linear term) becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). Experiments on up to 256 GPUs show that 1-bit Adam enables up to $3.3\times$ higher throughput for BERT-Large pre-training and up to $2.9\times$ higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work.

Authors (9)

Hanlin Tang (34 papers)
Shaoduo Gan (9 papers)
Ammar Ahmad Awan (15 papers)
Samyam Rajbhandari (21 papers)
Conglong Li (15 papers)
Xiangru Lian (18 papers)
Ji Liu (285 papers)
Ce Zhang (215 papers)
Yuxiong He (59 papers)

Citations (77)

View on Semantic Scholar

Summary

Overview of "APMSqueeze: A Communication Efficient ADAM Preconditioned Momentum SGD Algorithm"

The paper introduces APMSqueeze, an innovative approach focused on enhancing the efficiency of the ADAM optimization algorithm through communication compression. Recognizing the limitations of ADAM's compatibility with gradient compression technologies, the authors propose a method that retains convergence efficiency while significantly reducing communication overhead.

Key Contributions

Problem Identification: The paper addresses a crucial bottleneck in distributed training with ADAM.optimizer, highlighting its incompatibility with existing gradient compression techniques. This issue often impedes the efficient scaling of sophisticated optimization algorithms in distributed settings.
Algorithmic Innovation: The authors propose APMSqueeze, which leverages an error-compensated approach to compress gradients. By integrating ADAM preconditioning with momentum-based SGD, the algorithm maintains similar convergence rates to ADAM across training epochs, yet achieves a reduction in communication costs by four to eightfold during training on significant datasets like BERT.
Theoretical Analysis: Rigorous theoretical foundations are provided, ensuring that the proposed algorithm exhibits an asymptotic convergence rate matching that of the uncompressed counterpart. The analysis demonstrates the linear speedup effect, attributing it to the effective error compensation strategy employed in the communication compression.
Empirical Validation: Comprehensive experiments underscore the algorithm's robustness and efficiency. APMSqueeze achieves comparable convergence and accuracy results to uncompressed ADAM while reaching up to 32 times communication compression and reducing training times substantially across various machine learning models such as BERT-Base, BERT-Large, ResNet, and DCGAN.

Methodological Insights

Error Compensation Strategy: The core of the APMSqueeze algorithm is its novel error-compensated gradient compression mechanism. This ensures that the loss of information due to gradient compression does not derail convergence, which is crucial for maintaining performance parity with ADAM.
Convergence Analysis: The paper carefully dissects the conditions under which compressed gradient updates can achieve convergence, providing valuable insights into the behavior of error accumulation and its mitigation.
Implementation Details: APMSqueeze is tested on distributed systems, utilizing efficient communication frameworks compatible with high-performance computing environments, showing significant throughput improvements.

Practical Implications and Future Work

The impact of APMSqueeze is multifaceted, with immediate implications for distributed training, especially in resource-constrained environments. By addressing communication overhead, the method enables more efficient use of bandwidth, making large-scale training feasible on less advanced infrastructure.

Future research avenues may explore:

Extending the APMSqueeze framework to other popular optimization algorithms beyond ADAM.
Investigating the impact of varying error compensation strategies and compression levels on different architectures and datasets.
Exploring the applicability in other areas like reinforcement learning, as suggested by the authors.

The paper represents a significant stride towards optimizing distributed training, offering insights that could galvanize further innovation in communication-efficient machine learning algorithms.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/DeepSpeed: DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. (33,279 stars)