Random Scaling and Momentum for Non-smooth Non-convex Optimization (2405.09742v1)

Published 16 May 2024 in cs.LG and math.OC

Abstract: Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.

References (42)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a random scaling technique in SGDM that broadens its theoretical convergence guarantees to non-smooth, non-convex loss functions.
It develops the Exponentiated O2NC framework, achieving optimal convergence rates with relaxed stationarity conditions.
Experimental results on CIFAR-10 with ResNet-18 confirm that the modified SGDM attains comparable performance to standard SGDM while enhancing robustness.

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Introduction

Deep learning models often require optimizing highly irregular loss functions that can be both non-convex and non-smooth. Typical training algorithms, like Stochastic Gradient Descent with Momentum (SGDM), rely on the assumptions of either convexity or smoothness for their theoretical guarantees. This paper presents a simple yet powerful modification to SGDM: scaling the update at each iteration with an exponentially distributed random scalar. This modification extends SGDM's theoretical convergence guarantees to cases where the loss function is neither convex nor smooth.

Not only does this paper provide a comprehensive theoretical framework for such algorithms, but it also intriguingly reveals that the commonly used SGDM can be adapted for non-convex and non-smooth scenarios with just a minor tweak.

Key Contributions

New Notion of Stationarity

The paper introduces a new concept of stationarity tailored for non-smooth, non-convex objectives. This relaxed version of the Goldstein stationary point allows for more flexible algorithm designs, bridging the gap between theory and practical implementations. A point is considered stationary if it satisfies a more lenient set of conditions regarding the gradients within a specific neighborhood, measured in terms of expected squared distances.

Exponentiated Online-to-non-convex Conversion (O2NC)

The authors extend the original O2NC framework with some key improvements:

Unconstrained Iterates: The algorithm doesn't require constraining iterates within a small ball, allowing for larger updates when far from a stationary point.
Evaluation at Actual Iterates: Unlike the original O2NC, gradients are evaluated at the actual iterates rather than an intermediate variable, simplifying implementation and reducing memory usage.
Exponentially Weighted Gradients: Gradients are amplified using an exponential factor, prioritizing more recent gradients and improving convergence rates.

Theoretical Results

The paper provides strong theoretical guarantees for the proposed method:

Optimal Rates: When the objective is smooth, the algorithm achieves the optimal rate of $O(\epsilon^{-4})$ iterations to find an $\epsilon$ -stationary point. For second-order smooth objectives, the optimal rate improves to $O(\epsilon^{-7/2})$ .
Relaxed Convergence: Using a relaxed criterion for stationarity, the Exponentiated O2NC framework achieves optimal convergence guarantees for non-smooth, non-convex problems.

Numerical Results

Convergence results show impressive performance:

Theoretical Convergence: Convergence bounds derived in the paper indicate that the algorithm equates to SGDM when the exponential random variable is considered, achieving optimal convergence rates under both smooth and second-order smooth settings.

Practical Implications

The implications of this research are notable for the practical training of neural networks:

Versatility: The algorithm's ability to handle non-smooth non-convex objectives broadens its applicability, as many deep learning architectures (e.g., ReLU, max pooling) introduce non-smoothness.
Robustness: The empirical results confirm that the modified SGDM performs comparably to standard SGDM, ensuring robustness in real-world scenarios without compromising performance.

Speculation on Future Developments

Building on this framework, potential future developments in AI could involve:

Adapting Other Optimization Algorithms: Similar modifications can potentially be made to other optimization algorithms to extend their theoretical guarantees to more complex loss landscapes.
Hybrid Methods: Combining this approach with other techniques, such as adaptive learning rates or advanced momentum strategies, might yield even more efficient and robust optimization methods.

Experimental Validation

The theory extends into practical utility confirmed via experiments on CIFAR-10 with ResNet-18:

Comparable Performance: The SGDM with random scaling shows nearly identical performance to the standard SGDM in terms of train loss, train accuracy, test loss, and test accuracy.
Consistency: Across multiple runs, the modified SGDM consistently performs well, further establishing its reliability and efficacy.

Here is a summary of the experimental results comparing SGDM with and without random scaling:

| Random Scaling | No | Yes | |-|-|| | Train loss (×10⁻⁴) | 9.82 ± 0.21 | 9.55 ± 0.37 | | Train accuracy (%) | 100.0 ± 0.0 | 100.0 ± 0.0 | | Test loss (×10⁻²) | 21.6 ± 0.1 | 22.0 ± 0.4 | | Test accuracy (%) | 94.6 ± 0.1 | 94.4 ± 0.2 |

Conclusion

This paper bridges a significant theoretical gap in optimization for deep learning by demonstrating that a minor modification to a well-known algorithm can extend its effectiveness to non-convex, non-smooth scenarios. The proposed Exponentiated O2NC framework not only achieves optimal convergence guarantees but also closely resembles standard SGDM, making it highly practical for real-world applications.

This work opens doors for future research in optimization, particularly in developing algorithms that maintain theoretical guarantees while being straightforward and efficient in practice.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sp_monte_carlo/status/1791850888111255877