Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Communication-efficient distributed SGD with Sketching (1903.04488v3)

Published 12 Mar 2019 in cs.LG, cs.DC, math.OC, and stat.ML

Abstract: Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce Sketched SGD, an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. We show that Sketched SGD has favorable convergence rates on several classes of functions. When considering all communication -- both of gradients and of updated model weights -- Sketched SGD reduces the amount of communication required compared to other gradient compression methods from $\mathcal{O}(d)$ or $\mathcal{O}(W)$ to $\mathcal{O}(\log d)$, where $d$ is the number of model parameters and $W$ is the number of workers participating in training. We run experiments on a transformer model, an LSTM, and a residual network, demonstrating up to a 40x reduction in total communication cost with no loss in final model performance. We also show experimentally that Sketched SGD scales to at least 256 workers without increasing communication cost or degrading model performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Nikita Ivkin (12 papers)
  2. Daniel Rothchild (11 papers)
  3. Enayat Ullah (15 papers)
  4. Vladimir Braverman (99 papers)
  5. Ion Stoica (177 papers)
  6. Raman Arora (46 papers)
Citations (188)

Summary

Communication-efficient Distributed SGD with Sketching

The paper "Communication-efficient Distributed SGD with Sketching" addresses a significant challenge in large-scale distributed training of neural networks—excessive communication costs that often surpass local computation time. The authors propose a novel algorithm, \ssgd, which utilizes sketching techniques to alleviate the bandwidth bottleneck typically encountered in distributed Stochastic Gradient Descent (SGD).

Overview

Distributed training strategies must cope with the substantial communication overhead imposed by transmitting full gradient information among network nodes. This paper introduces an innovative approach by incorporating sketching methods, widely celebrated for their ability to handle data efficiently in streaming and sub-linear algorithms, into the distributed SGD paradigm. \ssgd proposes transmitting compressed object representations or "sketches" instead of complete gradient data, substantially decreasing the amount of communicated information during training.

Technical Contributions

The key contribution of \ssgd lies in its ability to maintain favorable convergence rates across multiple classes of optimization functions while dramatically reducing communication load. Specifically, it slashes communication complexity from O(d)\mathcal{O}(d) or O(W)\mathcal{O}(W)—where dd represents the number of model parameters and WW the number of workers—to O(logd)\mathcal{O}(\log d). Empirical results demonstrate that \ssgd achieves up to a 40-fold reduction in communication costs without degrading the final model performance. Experiments spanning diverse architectures, including transformers, LSTMs, and residual networks, provide robust validation of the algorithm's efficacy.

Notably, \ssgd performs effectively in scaling up to 256 workers. It exhibits no increase in communication costs or deterioration in model performance with expanded worker collaboration, highlighting its potential for massive parallel implementations.

Implications and Future Directions

The practical implications of this research are profound, with immediate applications in federated learning setups, collaborative filtering for recommendation systems, and distributed deep learning on edge devices. This efficiency in communication makes \ssgd a viable candidate for use in environments constrained by hardware limitations or network speeds.

Theoretically, the introduction of sketching methods into SGD opens avenues for further exploration into alternative compression techniques within distributed machine learning. Future developments may focus on enhancing the precision of sketches, reducing their complexity further, and integrating them into other machine-learning paradigms beyond SGD.

Conclusion

In summary, "Communication-efficient Distributed SGD with Sketching" offers a profound advancement in optimizing communication costs in distributed neural network training. By leveraging the capabilities of sketching techniques, the authors provide a scalable and efficient solution that maintains robust model performance while significantly minimizing communication overhead—a leap forward in achieving efficient distributed learning frameworks.