Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taming Momentum in a Distributed Asynchronous Environment (1907.11612v3)

Published 26 Jul 2019 in cs.LG, cs.DC, and stat.ML

Abstract: Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness - the main difficulty in scaling stochastic gradient descent to large clusters. Momentum, which is often used to accelerate convergence and escape local minima, exacerbates the gradient staleness, thereby hindering convergence. We propose DANA: a novel technique for asynchronous distributed SGD with momentum that mitigates gradient staleness by computing the gradient on an estimated future position of the model's parameters. Thereby, we show for the first time that momentum can be fully incorporated in asynchronous training with almost no ramifications to final accuracy. Our evaluation on the CIFAR and ImageNet datasets shows that DANA outperforms existing methods, in both final accuracy and convergence speed while scaling up to a total batch size of 16K on 64 asynchronous workers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ido Hakimi (9 papers)
  2. Saar Barkai (2 papers)
  3. Moshe Gabel (6 papers)
  4. Assaf Schuster (17 papers)
Citations (21)

Summary

We haven't generated a summary for this paper yet.