Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Random Scaling and Momentum for Non-smooth Non-convex Optimization (2405.09742v1)

Published 16 May 2024 in cs.LG and math.OC

Abstract: Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise, 2024.
  2. Allen-Zhu, Z. Natasha 2: Faster non-convex optimization than sgd. In Advances in neural information processing systems, pp. 2675–2686, 2018.
  3. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
  4. Second-order information in non-convex stochastic optimization: Power and limitations. In Conference on Learning Theory, pp.  242–299, 2020.
  5. Lower bounds for non-convex stochastic optimization. Mathematical Programming, pp.  1–50, 2022.
  6. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  7. “convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions. In International Conference on Machine Learning, pp. 654–663. PMLR, 2017.
  8. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.
  9. Lower bounds for finding stationary points i. Mathematical Programming, pp.  1–50, 2019.
  10. Prediction, learning, and games. Cambridge university press, 2006.
  11. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538–543, 1993.
  12. Momentum improves normalized sgd. In International Conference on Machine Learning, 2020.
  13. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210–15219, 2019.
  14. Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. In International Conference on Machine Learning (ICML), 2023.
  15. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019. doi: 10.1137/18M1178244.
  16. Composite objective mirror descent. In COLT, volume 10, pp.  14–26. Citeseer, 2010.
  17. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689–699, 2018.
  18. Sharp analysis for nonconvex sgd escaping from saddle points. In Conference on Learning Theory, pp.  1192–1234, 2019.
  19. Online mirror descent and dual averaging: keeping pace in the dynamic case, 2021.
  20. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  21. Goldstein, A. A. Optimization of lipschitz continuous functions. Math. Program., 13(1):14–22, dec 1977. ISSN 0025-5610. doi: 10.1007/BF01584320. URL https://doi.org/10.1007/BF01584320.
  22. Hazan, E. Introduction to online convex optimization. arXiv preprint arXiv:1909.05207, 2019.
  23. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  24. Parameter-free mirror descent. In Loh, P.-L. and Raginsky, M. (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp.  4160–4211. PMLR, 2022. URL https://proceedings.mlr.press/v178/jacobsen22a.html.
  25. Deterministic nonsmooth nonconvex optimization, 2023.
  26. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  27. On the complexity of finding small subgradients in nonsmooth optimization. arXiv preprint arXiv:2209.10346, 2022a.
  28. Oracle complexity in nonsmooth nonconvex optimization. Journal of Machine Learning Research, 23(314):1–44, 2022b.
  29. An algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization, 2023.
  30. Learning multiple layers of features from tiny images, 2009.
  31. Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization, 2022.
  32. Convergence of a stochastic gradient method with momentum for non-smooth non-convex optimization. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  6630–6639. PMLR, 13–18 Jul 2020.
  33. Orabona, F. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
  34. Scale-free online learning. arXiv preprint arXiv:1601.01974, 2016.
  35. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pp.  1139–1147, 2013.
  36. Stochastic cubic regularization for fast nonconvex optimization. In Advances in neural information processing systems, pp. 2899–2908, 2018.
  37. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 6:12, 2017.
  38. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
  39. Complexity of finding stationary points of nonsmooth nonconvex functions. 2020.
  40. Stochastic nested variance reduction for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp.  3925–3936, 2018.
  41. Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp.  928–936, 2003.
  42. Laprop: Separating momentum and adaptivity in adam, 2021.
Citations (4)

Summary

  • The paper introduces a random scaling technique in SGDM that broadens its theoretical convergence guarantees to non-smooth, non-convex loss functions.
  • It develops the Exponentiated O2NC framework, achieving optimal convergence rates with relaxed stationarity conditions.
  • Experimental results on CIFAR-10 with ResNet-18 confirm that the modified SGDM attains comparable performance to standard SGDM while enhancing robustness.

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Introduction

Deep learning models often require optimizing highly irregular loss functions that can be both non-convex and non-smooth. Typical training algorithms, like Stochastic Gradient Descent with Momentum (SGDM), rely on the assumptions of either convexity or smoothness for their theoretical guarantees. This paper presents a simple yet powerful modification to SGDM: scaling the update at each iteration with an exponentially distributed random scalar. This modification extends SGDM's theoretical convergence guarantees to cases where the loss function is neither convex nor smooth.

Not only does this paper provide a comprehensive theoretical framework for such algorithms, but it also intriguingly reveals that the commonly used SGDM can be adapted for non-convex and non-smooth scenarios with just a minor tweak.

Key Contributions

New Notion of Stationarity

The paper introduces a new concept of stationarity tailored for non-smooth, non-convex objectives. This relaxed version of the Goldstein stationary point allows for more flexible algorithm designs, bridging the gap between theory and practical implementations. A point is considered stationary if it satisfies a more lenient set of conditions regarding the gradients within a specific neighborhood, measured in terms of expected squared distances.

Exponentiated Online-to-non-convex Conversion (O2NC)

The authors extend the original O2NC framework with some key improvements:

  • Unconstrained Iterates: The algorithm doesn't require constraining iterates within a small ball, allowing for larger updates when far from a stationary point.
  • Evaluation at Actual Iterates: Unlike the original O2NC, gradients are evaluated at the actual iterates rather than an intermediate variable, simplifying implementation and reducing memory usage.
  • Exponentially Weighted Gradients: Gradients are amplified using an exponential factor, prioritizing more recent gradients and improving convergence rates.

Theoretical Results

The paper provides strong theoretical guarantees for the proposed method:

  • Optimal Rates: When the objective is smooth, the algorithm achieves the optimal rate of O(ϵ4)O(\epsilon^{-4}) iterations to find an ϵ\epsilon-stationary point. For second-order smooth objectives, the optimal rate improves to O(ϵ7/2)O(\epsilon^{-7/2}).
  • Relaxed Convergence: Using a relaxed criterion for stationarity, the Exponentiated O2NC framework achieves optimal convergence guarantees for non-smooth, non-convex problems.

Numerical Results

Convergence results show impressive performance:

  • Theoretical Convergence: Convergence bounds derived in the paper indicate that the algorithm equates to SGDM when the exponential random variable is considered, achieving optimal convergence rates under both smooth and second-order smooth settings.

Practical Implications

The implications of this research are notable for the practical training of neural networks:

  • Versatility: The algorithm's ability to handle non-smooth non-convex objectives broadens its applicability, as many deep learning architectures (e.g., ReLU, max pooling) introduce non-smoothness.
  • Robustness: The empirical results confirm that the modified SGDM performs comparably to standard SGDM, ensuring robustness in real-world scenarios without compromising performance.

Speculation on Future Developments

Building on this framework, potential future developments in AI could involve:

  • Adapting Other Optimization Algorithms: Similar modifications can potentially be made to other optimization algorithms to extend their theoretical guarantees to more complex loss landscapes.
  • Hybrid Methods: Combining this approach with other techniques, such as adaptive learning rates or advanced momentum strategies, might yield even more efficient and robust optimization methods.

Experimental Validation

The theory extends into practical utility confirmed via experiments on CIFAR-10 with ResNet-18:

  • Comparable Performance: The SGDM with random scaling shows nearly identical performance to the standard SGDM in terms of train loss, train accuracy, test loss, and test accuracy.
  • Consistency: Across multiple runs, the modified SGDM consistently performs well, further establishing its reliability and efficacy.

Here is a summary of the experimental results comparing SGDM with and without random scaling:

| Random Scaling | No | Yes | |-|-|| | Train loss (×10⁻⁴) | 9.82 ± 0.21 | 9.55 ± 0.37 | | Train accuracy (%) | 100.0 ± 0.0 | 100.0 ± 0.0 | | Test loss (×10⁻²) | 21.6 ± 0.1 | 22.0 ± 0.4 | | Test accuracy (%) | 94.6 ± 0.1 | 94.4 ± 0.2 |

Conclusion

This paper bridges a significant theoretical gap in optimization for deep learning by demonstrating that a minor modification to a well-known algorithm can extend its effectiveness to non-convex, non-smooth scenarios. The proposed Exponentiated O2NC framework not only achieves optimal convergence guarantees but also closely resembles standard SGDM, making it highly practical for real-world applications.

This work opens doors for future research in optimization, particularly in developing algorithms that maintain theoretical guarantees while being straightforward and efficient in practice.

X Twitter Logo Streamline Icon: https://streamlinehq.com