signSGD with Majority Vote is Communication Efficient And Fault Tolerant (1810.05291v3)

Published 11 Oct 2018 in cs.DC, cs.AI, and cs.LG

Abstract: Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32\times$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

Authors (4)

Jeremy Bernstein (25 papers)
Jiawei Zhao (30 papers)
Kamyar Azizzadenesheli (92 papers)
Anima Anandkumar (236 papers)

Citations (46)

View on Semantic Scholar

Summary

SignSGD with Majority Vote: Communication Efficiency and Fault Tolerance

The paper "signSGD with Majority Vote is Communication Efficient and Fault Tolerant" explores a streamlined approach to distributed optimization for neural networks that addresses both communication efficiency and robustness to network faults. The authors focus on a method called signSGD with majority vote, where workers communicate only the sign of gradient vectors, ensuring both reduced communication overhead and resilience to adversarial attacks.

Key Contributions

Algorithm Overview: The core idea behind signSGD with majority vote is straightforward—each worker sends the sign of the computed gradients to a parameter server. The server aggregates these signs by applying a majority vote, determining the overall update direction. This reduces the communication required by a factor of 32 compared to full-precision SGD since only one bit per dimension is transmitted per worker.
Convergence Guarantees: The authors establish convergence rates for signSGD in both large and mini-batch settings. They provide theoretical proofs to show that the algorithm converges under realistic noise assumptions, notably when gradient noise is unimodal and symmetric, achieving a convergence rate comparable to traditional SGD. This concurs with the practical intuition that such noise patterns are frequent due to the Central Limit Theorem in mini-batch scenarios.
Fault Tolerance: The majority vote mechanism inherently reduces the influence of any single worker's gradient, offering an elegant solution to Byzantine fault tolerance. The paper shows that signSGD is robust even if up to 50% of workers are adversarial, with adversaries modeled to invert or randomize their gradient estimates.
Empirical Validation: Experimental results, primarily using Pytorch for practical implementation, demonstrate signSGD's competitive efficiency and robustness. Experiments on large-scale datasets such as Imagenet, using ResNet models, indicate a training speed increase by 25% compared to conventional distributed SGD, albeit with some loss in generalization accuracy.

Implications

From a technical standpoint, signSGD with majority vote could significantly influence the development of distributed machine learning systems, especially in environments where communication bandwidth is constrained. The fault tolerance shown by this approach suggests practical relevance for settings having unreliable network connections.

The theoretical framework laid out not only provides insights into the behavior of sign-based optimization methods but also highlights their applicability beyond convex optimization, situating them firmly within the non-convex optimization field typical of deep learning applications.

Future Work

There are several promising directions for further exploration. One is optimizing the parameter server to make communication more efficient and scalable. The theoretical exploration indicates that adjusting per-worker mini-batch sizes could enhance model performance in practical scenarios, potentially mitigating generalization gaps observed in experiments. Additionally, the connection between signSGD and model compression—given its tendency to direct weights toward specific values—could inspire novel approaches to model size reduction and efficient storage.

In conclusion, "signSGD with Majority Vote is Communication Efficient and Fault Tolerant" provides a compelling argument for the use of sign-based methods in distributed optimization, offering robust convergence with reduced communication overhead and fault tolerance, making it a significant contribution to the optimization strategies in distributed and large-scale machine learning systems.

Related Papers

Find Related Papers

Tweets

https://twitter.com/Azizzadenesheli/status/1884777494911561851

YouTube

Show All Videos