Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication (1902.00340v1)

Published 1 Feb 2019 in cs.LG, cs.DC, cs.DS, math.OC, and stat.ML

Abstract: We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by $\omega \leq 1$ ($\omega=1$ meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta² \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, $\mathcal{O}(1/(nT))$, is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time $\mathcal{O}(1/(\delta^2\omega) \log (1/\epsilon))$ for accuracy $\epsilon > 0$. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for $\omega > 0$ and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.

Citations (473)

View on Semantic Scholar

Summary

The paper presents Choco-SGD, a decentralized SGD method that matches centralized convergence rates even with compressed updates.
The paper develops Choco-Gossip, an algorithm achieving a linear consensus convergence rate while handling arbitrary compressed messages.
Experimental results confirm that both algorithms notably cut communication overhead and enhance scalability in decentralized learning systems.

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

This paper addresses decentralized stochastic optimization, which is vital in scenarios where the objective function is distributed across multiple machines that can communicate only through a defined network topology. This setup can significantly reduce the communication burden when model updates are compressed using techniques such as quantization or sparsification.

Main Contributions

The paper presents three key contributions:

Choco-SGD Algorithm: This is a novel gossip-based stochastic gradient descent (SGD) algorithm designed to optimize convex objectives. It exhibits a convergence rate of $\mathcal{O}(1/(nT) + 1/(T \delta^2 \omega)^2)$ for strongly convex objectives, with $T$ denoting the number of iterations, $\delta$ the eigengap of the connectivity matrix, and $\omega$ the quality of compression. Notably, the first term in the convergence rate, $\mathcal{O}(1/(nT))$ , is equivalent to the centralized baseline with exact communication, indicating that the network topology and compression have minimal impact on the convergence rate.
Choco-Gossip Algorithm: This algorithm addresses the average consensus problem with a linear convergence rate of $\mathcal{O}(1/(\delta^2\omega) \log (1/\epsilon))$ . Choco-Gossip supports arbitrary compressed messages without sacrificing convergence accuracy, which is a novel advancement in gossip algorithms where previous methods required high precision and only converged to a neighborhood of the optimal solution.
Experimental Validation: The paper includes experiments demonstrating that Choco-SGD and Choco-Gossip outperform existing algorithms by significantly reducing communication overhead while maintaining superior convergence properties.

Implications and Future Directions

The advancements in Choco-SGD and Choco-Gossip notably impact the efficiency and scalability of decentralized learning systems. By effectively mitigating the communication bottleneck, the algorithms facilitate more scalable and fault-tolerant computation without the need for a central coordinator. This opens pathways for more effective deployment in large data centers and enables on-device computations for decentralized data privacy.

The results prompt further exploration in several areas:

Extending to Non-Convex Problems: While the current work focuses on convex optimization, extending these techniques to non-convex domains, such as those encountered in deep learning, could unlock further potential.
Enhancing Compression Techniques: Exploring more advanced compression strategies or adaptive methods could improve efficiency and applicability in diverse network settings or with varying data distributions.
Real-World Applications: Applying these algorithms to real-world distributed systems, such as federated learning across multiple organizations, could be transformative for industries that handle sensitive data.

In summary, the methodologies presented in the paper significantly advance the state of decentralized optimization, particularly in reducing communication overhead while maintaining robust convergence properties. These developments lay the groundwork for more efficient and scalable distributed learning systems, further influencing future research and practical applications in artificial intelligence.

PDF Markdown

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication (1902.00340v1)

Summary

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

Main Contributions

Implications and Future Directions

Related Papers