Toward a Unified Theory of Gradient Descent under Generalized Smoothness (2412.11773v2)

Published 16 Dec 2024 in math.OC

Abstract: We study the classical optimization problem $\min_{x \in \mathbb{R}^d} f(x)$ and analyze the gradient descent (GD) method in both nonconvex and convex settings. It is well-known that, under the $L$-smoothness assumption ($|\nabla² f(x)| \leq L$), the optimal point minimizing the quadratic upper bound $f(x_k) + \langle\nabla f(x_k), x_{k+1} - x_k\rangle + \frac{L}{2} |x_{k+1} - x_k|^2$ is $x_{k+1} = x_k - \gamma_k \nabla f(x_k)$ with step size $\gamma_k = \frac{1}{L}$. Surprisingly, a similar result can be derived under the $\ell$-generalized smoothness assumption ($|\nabla² f(x)| \leq \ell(|\nabla f(x)|)$). In this case, we derive the step size $$\gamma_k = \int_{0}^{1} \frac{d v}{\ell(|\nabla f(x_k)| + |\nabla f(x_k)| v)}.$$ Using this step size rule, we improve upon existing theoretical convergence rates and obtain new results in several previously unexplored setups.