Faster Gradient-Free Algorithms for Nonsmooth Nonconvex Stochastic Optimization (2301.06428v3)
Abstract: We consider the optimization problem of the form $\min_{x \in \mathbb{R}d} f(x) \triangleq \mathbb{E}{\xi} [F(x; \xi)]$, where the component $F(x;\xi)$ is $L$-mean-squared Lipschitz but possibly nonconvex and nonsmooth. The recently proposed gradient-free method requires at most $\mathcal{O}( L4 d{3/2} \epsilon{-4} + \Delta L3 d{3/2} \delta{-1} \epsilon{-4})$ stochastic zeroth-order oracle complexity to find a $(\delta,\epsilon)$-Goldstein stationary point of objective function, where $\Delta = f(x_0) - \inf{x \in \mathbb{R}d} f(x)$ and $x_0$ is the initial point of the algorithm. This paper proposes a more efficient algorithm using stochastic recursive gradient estimators, which improves the complexity to $\mathcal{O}(L3 d{3/2} \epsilon{-3}+ \Delta L2 d{3/2} \delta{-1} \epsilon{-3})$.
- Allen-Zhu, Z. How to make the gradients small stochastically: Even faster convex and nonconvex SGD. NeurIPS, 2018.
- Highly-smooth zero-th order online optimization. In COLT. PMLR, 2016.
- Towards minimax policies for online linear optimization with bandit feedback. In COLT, 2012.
- Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017.
- LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology, 2(3):1–27, 2011. URL https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
- ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Workshop on AISec, pp. 15–26, 2017.
- Structured evolution with compact architectures for scalable policy optimization. In ICML, 2018.
- Clarke, F. H. Optimization and nonsmooth analysis. SIAM, 1990.
- Momentum-based variance reduction in non-convex SGD. In NeurIPS, 2019.
- Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. arXiv preprint arXiv:2302.03775, 2023.
- A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions. In NeurIPS, 2022.
- Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
- Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
- Duffie, D. Dynamic asset pricing theory. Princeton University Press, 2010.
- Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
- SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In NeurIPS, 2018.
- Online convex optimization in the bandit setting: gradient descent without a gradient. arXiv preprint cs/0408007, 2004.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Deep sparse rectifier neural networks. In AISTATS, 2011.
- Goldstein, A. Optimization of lipschitz continuous functions. Mathematical Programming, 13(1):14–22, 1977.
- Discrete optimization via simulation. Handbook of simulation optimization, pp. 9–44, 2015.
- Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization. Journal of Machine Learning Research, 23(36):1–70, 2022.
- Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. In ICML, 2019.
- Asynchronous distributed reinforcement learning for lqr control via zeroth-order block coordinate descent. arXiv preprint arXiv:2107.12416, 2021.
- On the complexity of deterministic nonsmooth and nonconvex optimization. arXiv preprint arXiv:2209.12463, 2022.
- Oracle complexity in nonsmooth nonconvex optimization. NeurIPS, 2021.
- On the complexity of finding small subgradients in nonsmooth optimization. arXiv preprint arXiv:2209.10346, 2022.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- A geometric structure of acceleration and its role in making gradients small fast. NeurIPS, 2021.
- STORM+: Fully adaptive SGD with recursive momentum for nonconvex optimization. In NeurIPS, 2021.
- Li, Z. SSRGD: Simple stochastic recursive gradient descent for escaping saddle points. In NeurIPS, 2019.
- Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization. Journal of Machine Learning Research, 23(239):1–61, 2022.
- PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In ICML, 2021.
- Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization. arXiv preprint arXiv:2209.05045, 2022.
- Zeroth-order stochastic variance reduction for nonconvex optimization. In NeurIPS, 2018.
- Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
- Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138, 2011.
- Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
- Nelson, B. L. Optimization via simulation over discrete decision variables. Risk and optimization in an uncertain world, pp. 193–207, 2010.
- Nesterov, Y. How to make the gradients small. Optima. Mathematical Optimization Society Newsletter, 88:10–11, 2012.
- Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
- SARAH: A novel method for machine learning problems using stochastic recursive gradient. In ICML, 2017.
- ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. Journal of Machine Learning Research, 21(110):1–48, 2020.
- Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In NIPS, 2016.
- Shamir, O. On the complexity of bandit and derivative-free stochastic convex optimization. In COLT. PMLR, 2013.
- Shamir, O. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(1):1703–1713, 2017.
- Stadtler, H. Supply chain management—an overview. Supply chain management and advanced planning, pp. 9–36, 2008.
- Do differentiable simulators give better policy gradients? In ICML, 2022.
- On the hardness of computing near-approximate stationary points of clarke regular nonsmooth nonconvex problems and certain DC programs. In ICML Workshop on Beyond First-Order Methods in ML Systems, 2021.
- No dimension-free deterministic algorithm computes approximate stationarities of lipschitzians. arXiv preprint arXiv:2210.06907, 2022.
- On the finite-time complexity and practical computation of approximate stationarity concepts of lipschitz functions. In ICML, 2022.
- SpiderBoost and momentum: Faster variance reduction algorithms. In NeurIPS, 2019.
- A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
- Hessian-aware zeroth-order optimization for black-box adversarial attack. arXiv preprint arXiv:1812.11377, 2018.
- Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894–942, 2010a.
- Gene selection using support vector machines with non-convex penalty. bioinformatics, 22(1):88–95, 2006.
- Complexity of finding stationary points of nonsmooth nonconvex functions. In ICML, 2020.
- Zhang, T. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11(3), 2010b.