Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Federated Optimization of Smooth Loss Functions (2201.01954v2)

Published 6 Jan 2022 in cs.LG, math.OC, math.ST, stat.ML, and stat.TH

Abstract: In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $\epsilon$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $\eta$-H\"older smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $\phi m(p/\epsilon){\Theta(d/\eta)}$ and that of FedAve scales like $\phi m(p/\epsilon){3/4}$ (neglecting sub-dominant factors), where $\phi\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” SIAM Journal on Scientific Computing, vol. 34, no. 3, pp. A1380–A1405, 2012.
  2. O. Devolder, F. Glineur, and Y. Nesterov, “First-order methods with inexact oracle: the strongly convex case,” CORE Discussion Papers 2013016, vol. 2013, no. 16, March 2013.
  3. A. M.-C. So and Z. Zhou, “Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity,” Optimization Methods and Software, vol. 32, no. 4, pp. 963–992, May 2017.
  4. M. Schmidt, N. Le Roux, and F. Bach, “Convergence rates of inexact proximal-gradient methods for convex optimization,” in Proceedings of the Advances in Neural Information Processing Systems 24 (NeurIPS), Granada, Spain, December 12-17 2011, pp. 1–9.
  5. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization, vol. 19, no. 4, pp. 1574–1609, January 2009.
  6. B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, December 1964.
  7. Y. E. Nesterov, “A method of solving a convex programming problem with convergence rate O⁢(1k2)𝑂1superscript𝑘2O\big{(}\frac{1}{k^{2}}\big{)}italic_O ( divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ),” Doklady Akademii Nauk SSSR, vol. 269, no. 3, pp. 543–547, 1983, in Russian.
  8. O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction using mini-batches,” Journal of Machine Learning Research, vol. 13, no. 6, pp. 165–202, January 2012.
  9. N. Le Roux, M. Schmidt, and F. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” in Proceedings of the Advances in Neural Information Processing Systems 25 (NeurIPS), Lake Tahoe, NV, USA, December 3-8 2012, pp. 1–9.
  10. M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, Series A, vol. 162, p. 83–112, March 2017.
  11. S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss minimization,” Journal of Machine Learning Research, vol. 14, pp. 567–599, February 2013.
  12. R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Proceedings of the Advances in Neural Information Processing Systems 26 (NeurIPS), Lake Tahoe, NV, USA, December 5-10 2013, pp. 315–323.
  13. J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. 61, pp. 2121–2159, July 2011.
  14. T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” 2012, COURSERA: Neural Networks for Machine Learning.
  15. D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, May 7-9 2015, pp. 1–13.
  16. L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, May 2018.
  17. P. Netrapalli, “Stochastic gradient descent and its variants in machine learning,” Journal of Indian Institute of Science, vol. 99, no. 2, pp. 201–213, June 2019.
  18. A. Jadbabaie, A. Makur, and D. Shah, “Gradient-based empirical risk minimization using local polynomial regression,” November 2020, arXiv:2011.02522 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2011.02522
  19. W. S. Cleveland and C. Loader, “Smoothing by local regression: Principles and methods,” in Statistical Theory and Computational Aspects of Smoothing: Proceedings of the COMPSTAT ’94 Satellite Meeting held in Semmering, Austria 27-28 August 1994, ser. Contributions to Statistics, W. Härdle and M. G. Schimek, Eds.   Heidelberg, Germany: Physica-Verlag, 1996, pp. 10–49.
  20. H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, April 20-22 2017, pp. 1273–1282.
  21. A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, August 26-28 2020, pp. 2021–2031.
  22. A. Reisizadeh, I. Tziotis, H. Hassani, A. Mokhtari, and R. Pedarsani, “Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity,” December 2020, arXiv:2012.14453 [cs.LG]. [Online]. Available: https://arxiv.org/pdf/2012.14453.pdf
  23. Z. Huo, Q. Yang, B. Gu, L. Carin, and H. Huang, “Faster on-device training using new federated momentum algorithm,” February 2020, arXiv:2002.02090v1 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2002.02090
  24. J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” January 2019, arXiv:1808.07576v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1808.07576
  25. R. Pathak and M. J. Wainwright, “FedSplit: an algorithmic framework for fast federated optimization,” in Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS), Vancouver, BC, Canada, December 6-12 2020, pp. 7057–7066.
  26. F. X. Yu, A. S. Rawat, A. K. Menon, and S. Kumar, “Federated learning with only positive labels,” in Proceedings of the 37th Annual International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research (PMLR), vol. 119, Vienna, Austria, July 13-18 2020, pp. 10 946–10 956.
  27. S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečný, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in Proceedings of the 10th International Conference on Learning Representations (ICLR), Vienna, Austria, April 25-29 2021, pp. 1–38.
  28. N. Guha, A. Talwalkar, and V. Smith, “One-shot federated learning,” March 2019, arXiv:1902.11175v2 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1902.11175
  29. V. S. Mai, R. J. La, T. Zhang, Y. Huang, and A. Battou, “Federated learning with server learning for non-iid data,” in Proceedings of the 57th Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, March 22-24 2023, pp. 1–6.
  30. K. Yang and C. Shen, “On the convergence of hybrid federated learning with server-clients collaborative training,” in Proceedings of the 56th Annual Conference on Information Sciences and Systems (CISS), Princeton, NJ, USA, March 9-11 2022, pp. 252–257.
  31. V. S. Mai, R. J. La, and T. Zhang, “Federated learning with server learning: Enhancing performance for non-iid data,” April 2023, arXiv:2210.02614v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2210.02614
  32. S. Chatterjee, “Matrix estimation by universal singular value thresholding,” The Annals of Statistics, vol. 43, no. 1, pp. 177–214, February 2015.
  33. J. Xu, “Rates of convergence of spectral methods for graphon estimation,” in Proceedings of the 35th Annual International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research (PMLR), vol. 80, Stockholm, Sweden, July 10-15 2018, pp. 5433–5442.
  34. A. Agarwal, D. Shah, D. Shen, and D. Song, “On robustness of principal component regression,” Journal of the American Statistical Association, vol. 116, no. 536, pp. 1731–1745, July 2021.
  35. Y. LeCun, C. Cortes, and C. J. C. Burges, “THE MNIST DATABASE of handwritten digits.” [Online]. Available: http://yann.lecun.com/exdb/mnist/
  36. A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset.” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.htm
  37. L. Wasserman, “Statistical methods for machine learning,” May 2019, Department of Statistics and Data Science, CMU, Pittsburgh, PA, USA, Lecture Notes 36-708.
  38. E. A. Nadaraya, “On estimating regression,” Theory of Probability and Its Applications, vol. 9, no. 1, pp. 141–142, 1964.
  39. G. S. Watson, “Smooth regression analysis,” Sankhyā: The Indian Journal of Statistics, Series A, vol. 26, no. 4, pp. 359–372, December 1964.
  40. E. Fix and J. L. Hodges, “Discriminatory analysis. Nonparametric discrimination: Consistency properties,” University of California, Berkeley, USAF School of Aviation Medicine, Randolph Field, San Antonio, TX, USA, Tech. Rep., February 1951.
  41. T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. IT-13, no. 1, pp. 21–27, January 1967.
  42. E. J. Candès and Y. Plan, “Matrix completion with noise,” Proceedings of the IEEE, vol. 98, no. 6, pp. 925–936, June 2010.
  43. D. Shah, D. Song, Z. Xu, and Y. Yang, “Sample efficient reinforcement learning via low-rank matrix estimation,” June 2020, arXiv:2006.06135 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2006.06135
  44. B. F. Logan and L. A. Shepp, “Optimal reconstruction of a function from its projections,” Duke Mathematical Journal, vol. 42, no. 4, pp. 645–659, December 1975.
  45. D. L. Donoho and I. M. Johnstone, “Projection-based approximation and a duality with kernel methods,” The Annals of Statistics, vol. 17, no. 1, pp. 58–106, March 1989.
  46. J. Chen, T. Yang, and S. Zhu, “Efficient low-rank stochastic gradient descent methods for solving semidefinite programs,” in Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), Reykjavik, Iceland, April 22-25 2014, pp. 122–130.
  47. P. G. Constantine, E. Dow, and Q. Wang, “Active subspace methods in theory and practice: Applications to kriging surfaces,” SIAM Journal on Scientific Computing, vol. 36, no. 4, pp. A1500–A1524, 2014.
  48. G. Gur-Ari, D. A. Roberts, and E. Dyer, “Gradient descent happens in a tiny subspace,” December 2018, arXiv:1812.04754 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1812.04754
  49. L. Sagun, U. Evci, V. U. Güney, Y. Dauphin, and L. Bottou, “Empirical analysis of the Hessian of over-parametrized neural networks,” in Proceedings of the Sixth International Conference on Learning Representations (ICLR) Workshop, Vancouver, BC, Canada, April 30-May 3 2018, pp. 1–14.
  50. V. Papyan, “The full spectrum of deepnet Hessians at scale: Dynamics with SGD training and sample size,” June 2019, arXiv:1811.07062v2 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1811.07062
  51. C. Cui, K. Zhang, T. Daulbaev, J. Gusak, I. Oseledets, and Z. Zhang, “Active subspace of neural networks: Structural analysis and universal attacks,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 4, pp. 1096–1122, 2020.
  52. Y. Wu, X. Zhu, C. Wu, A. Wang, and R. Ge, “Dissecting Hessian: Understanding common structure of Hessian in neural networks,” June 2021, arXiv:2010.04261v5 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2010.04261
  53. S. P. Singh, G. Bachmann, and T. Hofmann, “Analytic insights into structure and rank of neural network Hessian maps,” in Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS), Virtual, December 6-14 2021, pp. 23 914–23 927.
  54. T. Le and S. Jegelka, “Training invariances and the low-rank phenomenon: beyond linear networks,” in Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual, April 25-29 2022, pp. 1–26.
  55. T. Galanti, Z. S. Siegel, A. Gupte, and T. Poggio, “SGD and weight decay provably induce a low-rank bias in neural networks,” January 2023, arXiv:2206.05794v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2206.05794
  56. V. Papyan, X. Y. Han, and D. L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,” Proceedings of the National Academy of Sciences (PNAS), vol. 117, no. 40, pp. 24 652–24 663, October 2020.
  57. C. Fang, H. He, Q. Long, and W. J. Su, “Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training,” Proceedings of the National Academy of Sciences (PNAS), vol. 118, no. 43, pp. 1–12, October 2021.
  58. L. Hui, M. Belkin, and P. Nakkiran, “Limitations of neural collapse for understanding generalization in deep learning,” February 2022, arXiv:2202.08384 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2202.08384
  59. M. Gooneratne, , K. C. Sim, P. Zadrazil, A. Kabel, F. Beaufays, and G. Motta, “Low-rank gradient approximation for memory-efficient on-device training of deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, May 4-8 2020, pp. 3017–3021.
  60. D. Hsu, S. M. Kakade, and T. Zhang, “Random design analysis of ridge regression,” in 25th Annual Conference on Learning Theory (COLT), Proceedings of Machine Learning Research (PMLR), vol. 23, no. 9, Edinburgh, Scotland, June 25-27 2012, pp. 1–24.
  61. B. Woodworth, J. Wang, A. Smith, B. McMahan, and N. Srebro, “Graph oracle models, lower bounds, and gaps for parallel stochastic optimization,” in Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS), Montréal, QC, Canada, December 2-8 2018, pp. 1–11.
  62. N. Michelusi, G. Scutari, and C.-S. Lee, “Finite-bit quantization for distributed algorithms with linear convergence,” May 2022, arXiv:2107.11304v3 [math.OC]. [Online]. Available: https://arxiv.org/abs/2107.11304v3
  63. A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancing communication and computation in distributed optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3141–3155, August 2019.
  64. K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, March 2018.
  65. J. Xu, “Rates of convergence of spectral methods for graphon estimation,” September 2017, arXiv:1709.03183 [stat.ML]. [Online]. Available: https://arxiv.org/abs/1709.03183
  66. D. J. Aldous, “Representations for partially exchangeable arrays of random variables,” Journal of Multivariate Analysis, Elsevier, vol. 11, no. 4, pp. 581–598, December 1981.
  67. Y. Zhang, E. Levina, and J. Zhu, “Estimating network edge probabilities by neighbourhood smoothing,” Biometrika, vol. 104, no. 4, pp. 771–783, December 2017.
  68. T. Poggio, A. Banburski, and Q. Liao, “Theoretical issues in deep networks,” Proceedings of the National Academy of Sciences of the United States of America (PNAS), vol. 117, no. 48, pp. 30 039–30 045, June 2020.
  69. S. Dasgupta and A. Gupta, “An elementary proof of a theorem of Johnson and Lindenstrauss,” Random Structures and Algorithms, vol. 22, no. 1, pp. 60–65, January 2003.
  70. Y. Wu, “Information-theoretic methods for high-dimensional statistics,” January 2020, Department of Statistics and Datra Science, Yale University, New Haven, CT, USA, Lecture Notes S&DS 677.
  71. Q. Nguyen, H. Valizadegan, and M. Hauskrecht, “Learning classification with auxiliary probabilistic information,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), Vancouver, BC, Canada, December 11-14 2011, pp. 477–486.
  72. R. W. D. Nickalls, “A new approach to solving the cubic: Cardan’s solution revealed,” The Mathematical Gazette, vol. 77, no. 480, pp. 354–359, November 1993.
  73. R. W. D. Nickalls, “Viète, Descartes and the cubic equation,” The Mathematical Gazette, vol. 90, no. 518, pp. 203–208, July 2006.
Citations (6)

Summary

We haven't generated a summary for this paper yet.