Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness

Published 7 May 2025 in cs.LG and math.OC | (2505.04599v1)

Abstract: Recent results in non-convex stochastic optimization demonstrate the convergence of popular adaptive algorithms (e.g., AdaGrad) under the $(L_0, L_1)$-smoothness condition, but the rate of convergence is a higher-order polynomial in terms of problem parameters like the smoothness constants. The complexity guaranteed by such algorithms to find an $\epsilon$-stationary point may be significantly larger than the optimal complexity of $\Theta \left( \Delta L \sigma² \epsilon^{-4} \right)$ achieved by SGD in the $L$-smooth setting, where $\Delta$ is the initial optimality gap, $\sigma^2$ is the variance of stochastic gradient. However, it is currently not known whether these higher-order dependencies can be tightened. To answer this question, we investigate complexity lower bounds for several adaptive optimization algorithms in the $(L_0, L_1)$-smooth setting, with a focus on the dependence in terms of problem parameters $\Delta, L_0, L_1$. We provide complexity bounds for three variations of AdaGrad, which show at least a quadratic dependence on problem parameters $\Delta, L_0, L_1$. Notably, we show that the decorrelated variant of AdaGrad-Norm requires at least $\Omega \left( \Delta² L_1² \sigma² \epsilon^{-4} \right)$ stochastic gradient queries to find an $\epsilon$-stationary point. We also provide a lower bound for SGD with a broad class of adaptive stepsizes. Our results show that, for certain adaptive algorithms, the $(L_0, L_1)$-smooth setting is fundamentally more difficult than the standard smooth setting, in terms of the initial optimality gap and the smoothness constants.

Abstract PDF Upgrade to Chat

Summary

Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-Convex Stochastic Optimization under Relaxed Smoothness

The paper "Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness" addresses a fundamental question in optimization for deep learning: Can adaptive gradient methods such as AdaGrad-type algorithms achieve convergence without incurring higher-order polynomial complexity in problem parameters under relaxed smoothness conditions? The authors, Crawshaw and Liu, examine complexity lower bounds for adaptive optimization algorithms focusing on non-convex, stochastic optimization settings with relaxed smoothness.

Key Contributions

The paper provides pivotal insights into the intrinsic complexity limitations of adaptive algorithms within the therapeutic landscape of relaxed smooth conditions. Key contributions include:

Lower Bound Demonstration for Decorrelated AdaGrad-Norm: A cornerstone of the paper's contributions is the establishment of a complexity lower bound of $\Omega \left( \Delta² L_1² \sigma² \epsilon^{-4} \right)$. This result underscores that Decorrelated AdaGrad-Norm requires a quadratic dependence on problem parameters, hindering recovery of optimal convergence complexities typically achievable in standard smooth settings. This finding signifies a fundamental difficulty these algorithms face in the relaxed smooth setting compared to classical smoothness conditions.
Analysis of AdaGrad Variants: The paper meticulously examines both Decorrelated AdaGrad and the original AdaGrad algorithms. For Decorrelated AdaGrad, the paper establishes a lower bound with significant dependence on the variance of stochastic gradients, reinforcing the notion of stringent complexity bounds in settings featuring relaxed smoothness. The analysis of the original AdaGrad reflects similar dependencies, albeit with a different factor related to the stabilization constant, highlighting the nuanced complexities these algorithms incur.
Examination of Adaptive SGD with Single-Step Updates: The exploration extends to adaptive SGD variants leveraging single-step updates based on the current gradient information. The findings reveal a near-quadratic complexity dependency for these algorithms, illuminating the challenges posed by relaxed smoothness in adapting more efficiently to stochastic variances and initial optimality gaps.

Theoretical and Practical Implications

The insights derived from this study have profound implications:

Theoretical Implications: The robust theoretical framework and complexity analyses delineated in the paper set the stage for deep understanding of adaptive gradient algorithms’ limitations under relaxed smoothness. The complexity bounds emphasize the intricate relationship between algorithmic structure and problem-specific parameters, informing future development of optimization algorithms that may circumvent these dependencies.
Practical Implications: For practical applications in training deep learning models and large-scale language models, the results offer a grounding in understanding why certain adaptive methods may underperform relative to their theoretically predicted convergence rates when confronted with complex neural network architectures. Understanding these limitations is crucial for guiding the selection of optimization techniques in real-world applications and further research into developing more efficient versions.

Future Directions

Future research may build upon these findings by exploring alternative adaptive methods such as Adam or RMSProp under relaxed smoothness conditions, examining potential avenues for reducing complexity dependencies, and integrating empirical evaluations to complement theoretical insights. Another direction could focus on refining the assumptions and frameworks to align more closely with practical deep learning scenarios, potentially integrating structural aspects of neural networks more deeply into the optimization landscape.

In conclusion, the paper by Crawshaw and Liu provides a foundational understanding of complexity bounds in adaptive optimization methods under relaxed smooth conditions, elucidating challenges and guiding future innovations in optimization for machine learning.