Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization (2404.01200v1)
Abstract: Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with non-convex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for large-scale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes $\chi2$-divergences as a special case. We prove that our algorithm finds an $\epsilon$-stationary point with a computational complexity of $\mathcal O(\epsilon{-3k_*-5})$, where $k_*$ is the parameter of the Cressie-Read divergence. The numerical results indicate that our method outperforms existing methods.} Our method also applies to the smoothed conditional value at risk (CVaR) DRO.
- A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1): 131–142.
- Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2): 341–357.
- Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, 120–128.
- Remix: rebalanced mixup. In Proceedings of Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, 95–110. Springer.
- Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society Series B: Statistical Methodology, 46(3): 440–464.
- Csiszár, I. 1967. On information-type measure of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar., 2: 299–318.
- Adaptive sampling for stochastic risk-averse learning. In Proceedings of Advances in Neural Information Processing Systems, volume 33, 1036–1047.
- Domain adaptation for statistical classifiers. Journal of artificial Intelligence research, 26: 101–126.
- Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750.
- Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3): 946–969.
- An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2): 95–110.
- Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2): 267–305.
- Efficient stochastic gradient descent for distributionally robust learning. arXiv preprint arXiv:1805.08728.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
- SGD: General analysis and improved rates. In Proceedings of International conference on machine learning, 5200–5209. PMLR.
- Report on the evaluation of 2D still-image face recognition algorithms. US Department of Commerce, National Institute of Standards and Technology.
- Fairness without demographics in repeated loss minimization. In Proceedings of International Conference on Machine Learning, 1929–1938. PMLR.
- Tagging performance correlates with author age. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (volume 2: Short papers), 483–488.
- Kullback-Leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 1(2): 9.
- Jaggi, M. 2013. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of International conference on machine learning, 427–435. PMLR.
- Non-convex distributionally robust optimization: Non-asymptotic analysis. In Proceedings of Advances in Neural Information Processing Systems, volume 34, 2771–2782.
- Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in neural information processing systems, volume 25.
- Lacoste-Julien, S. 2016. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345.
- Large-scale methods for distributionally robust optimization. In Proceedings of Advances in Neural Information Processing Systems, volume 33, 8847–8860.
- On gradient descent ascent for nonconvex-concave minimax problems. In Proceedings of International Conference on Machine Learning, 6083–6093. PMLR.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
- Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24.
- Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of Advances in neural information processing systems, volume 29.
- Nesterov, Y. 2003. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
- An online method for a class of distributionally robust optimization with non-convex objectives. In Proceedings of Advances in Neural Information Processing Systems, volume 34, 10067–10080.
- Stochastic constrained dro with a complexity independent of sample size. arXiv preprint arXiv:2210.05740.
- Weakly-convex–concave min–max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, 37(3): 1087–1121.
- Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659.
- A stochastic approximation method. The annals of mathematical statistics, 400–407.
- Optimization of conditional value-at-risk. Journal of risk, 2: 21–42.
- Variational Analysis. Springer.
- Shapiro, A. 2017. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4): 2258–2275.
- Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571.
- Statistical learning with conditional value at risk. arXiv preprint arXiv:2002.05826.
- Optimizing the CVaR via sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
- Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7): 3797–3820.
- Sinkhorn distributionally robust optimization. arXiv preprint arXiv:2109.11926.
- A unified single-loop alternating gradient projection algorithm for nonconvex–concave and convex–nonconcave minimax problems. Mathematical Programming, 1–72.