Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DropCompute: simple and more robust distributed synchronous training via compute variance reduction (2306.10598v2)

Published 18 Jun 2023 in cs.LG

Abstract: Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers. Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Niv Giladi (7 papers)
  2. Shahar Gottlieb (3 papers)
  3. Moran Shkolnik (5 papers)
  4. Asaf Karnieli (5 papers)
  5. Ron Banner (20 papers)
  6. Elad Hoffer (23 papers)
  7. Kfir Yehuda Levy (10 papers)
  8. Daniel Soudry (76 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.