Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Utility of Gradient Compression in Distributed Training Systems (2103.00543v3)

Published 28 Feb 2021 in cs.DC and cs.LG

Abstract: A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent work proposes gradient and model compression methods. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD across more than 200 different setups. Surprisingly, we observe that only in 6 cases out of more than 200, gradient compression methods provide speedup over optimized synchronous data-parallel training in the typical data-center setting. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide a meaningful end-to-end speedup.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Saurabh Agarwal (19 papers)
  2. Hongyi Wang (62 papers)
  3. Shivaram Venkataraman (48 papers)
  4. Dimitris Papailiopoulos (59 papers)
Citations (41)

Summary

We haven't generated a summary for this paper yet.