Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

signSGD: Compressed Optimisation for Non-Convex Problems (1802.04434v3)

Published 13 Feb 2018 in cs.LG, cs.DC, and math.OC

Abstract: Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative $\ell_1/\ell_2$ geometry of gradients, noise and curvature informs whether signSGD or SGD is theoretically better suited to a particular problem. On the practical side we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/signSGD .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jeremy Bernstein (25 papers)
  2. Yu-Xiang Wang (124 papers)
  3. Kamyar Azizzadenesheli (92 papers)
  4. Anima Anandkumar (236 papers)
Citations (975)

Summary

signSGD: Compressed Optimization for Non-Convex Problems

The paper "signSGD: Compressed Optimization for Non-Convex Problems" by Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar introduces a novel method of gradient compression aimed at improving communication efficiency in distributed deep learning. The core idea is to transmit only the sign of each stochastic gradient, significantly reducing the volume of data exchanged between distributed computation nodes. This method maintains the convergence properties of traditional SGD under specific geometric conditions, thereby demonstrating its potential for practical implementation in large-scale machine learning tasks.

Summary

Introduction and Motivation

The necessity of distributing the training workload of large neural networks across multiple workers brings about substantial communication overhead due to gradient exchanges. The common parameter-server framework, which entails sending gradients of every model parameter between workers and the server, can be dramatically optimized by gradient compression. The paper contends that the signSGD algorithm, transmitting only the sign of each gradient component, can achieve both effective gradient compression and convergence rates comparable to SGD.

Key Algorithms and Theoretical Contributions

The paper introduces several key algorithms and accompanying theoretical proofs supporting their efficacy:

1. signSGD Algorithm:

  • Algorithm Overview: At each iteration, this algorithm updates parameters using the sign of the gradient rather than the gradient itself:
    1
    
    x_{k+1} ← x_k - δ * sign(tilde{g}_k)
  • Theoretical Justification: The asymptotic bias introduced by using the sign is theoretically analyzed, revealing that the method's effectiveness is determined by the relative geometry of gradients, noise, and curvature.

2. Distributed Training with Majority Vote:

  • Algorithm Overview: In a distributed setting, each worker sends the sign of its gradient to a parameter server, which then aggregates these signs via a majority vote. This reduces the gradient communication to 1-bit per parameter in both directions:
    1
    
    x_{k+1} ← x_k - δ * sign(∑_{m=1}^M sign(tilde{g}_m))
  • Variance Reduction: Utilizing the central limit theorem, it’s shown that combining gradient signs from multiple workers can reduce variance, thus achieving convergence properties similar to non-compressed distributed SGD.

Theoretical Implications

The paper provides detailed convergence analysis under the common assumptions in non-convex optimization:

Assumption 1: The objective function has a lower bound.

Assumption 2: The gradients are smooth with respect to an 1\ell_1 norm.

Assumption 3: The stochastic gradient oracle has bounded variance per coordinate.

The authors establish conditions under which signSGD presents comparable, and sometimes superior, performance to standard SGD. Specifically, in cases where gradients are dense relative to noise and curvature, signSGD may deliver enhanced performance stability due to its robustness against sparse gradients with high variance components.

Empirical Evaluation

The empirical results reinforce the theoretical findings. On a variety of tasks and datasets including CIFAR-10 and Imagenet, signSGD and its momentum-enhanced version, Signum, deliver competitive performance relative to Adam and standard SGD. Specifically:

  • On the CIFAR-10 dataset, both signSGD and Signum achieve comparable convergence rates to SGD.
  • On the larger and more complex Imagenet dataset, Signum matches the performance of Adam, demonstrating its viability for training deep models with compressed communication.

Future Implications and Research Directions

The practical implications of these findings extend to the design of distributed algorithms for deep learning, suggesting that sign-based methods can enable more efficient usage of communication bandwidth and thereby speed up training times. Future work may involve:

  • Exploring the integration of signSGD with other forms of gradient sparsification and quantization.
  • Developing adaptive heuristics to dynamically switch between SGD and signSGD based on run-time geometry analysis of gradients.
  • Extending the theoretical framework to accommodate other common optimizers in deep learning, such as variants of Nesterov's accelerated gradient methods.

In conclusion, the paper presents compelling theoretical and empirical support for sign-based gradient methods as a means to enhance communication efficiency in distributed deep learning. The convergence guarantees under specific geometrical conditions make signSGD a promising technique for optimizing large-scale neural networks, inviting further research and practical adoption.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com