Analysis of Kernel Mirror Prox for Measure Optimization (2403.00147v1)
Abstract: By choosing a suitable function space as the dual to the non-negative measure cone, we study in a unified framework a class of functional saddle-point optimization problems, which we term the Mixed Functional Nash Equilibrium (MFNE), that underlies several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters. We model the saddle-point optimization dynamics as an interacting Fisher-Rao-RKHS gradient flow when the function space is chosen as a reproducing kernel Hilbert space (RKHS). As a discrete time counterpart, we propose a primal-dual kernel mirror prox (KMP) algorithm, which uses a dual step in the RKHS, and a primal entropic mirror prox step. We then provide a unified convergence analysis of KMP in an infinite-dimensional setting for this class of MFNE problems, which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case, where $N$ is the iteration counter. As a case study, we apply our analysis to DRO, providing algorithmic guarantees for DRO robustness and convergence.
- Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
- Amari, S.-I. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media, 2008.
- Input convex neural networks. In International Conference on Machine Learning, pp. 146–155. PMLR, 2017.
- An adaptive mirror-prox method for variational inequalities with singular operators. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8455–8465. Curran Associates, Inc., 2019.
- Maximum Mean Discrepancy Gradient Flow. arXiv:1906.04370 [cs, stat], December 2019. URL http://arxiv.org/abs/1906.04370. arXiv: 1906.04370.
- Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. PMLR, 2017.
- Mirror descent with relative smoothness in measure spaces, with application to sinkhorn and em. Advances in Neural Information Processing Systems, 35:17263–17275, 2022.
- Bach, F. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
- Lectures on Modern Convex Optimization (Lecture Notes). Personal web-page of A. Nemirovski, 2023.
- Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
- Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
- Decentralized local stochastic extra-gradient for variational inequalities. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 38116–38133. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/f9379afacdbabfdc6b060972b60f9ab8-Paper-Conference.pdf.
- Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
- Theory of Classification: a Survey of Some Recent Advances. ESAIM: Probability and Statistics, 9:323–375, November 2005. ISSN 1292-8100, 1262-3318. doi: 10.1051/ps:2005018. URL http://www.esaim-ps.org/10.1051/ps:2005018.
- Svgd as a kernelized wasserstein gradient flow of the chi-squared divergence. Advances in Neural Information Processing Systems, 33:2098–2109, 2020.
- On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- The equivalence between stein variational gradient descent and black-box variational inference. arXiv preprint arXiv:2004.01822, 2020.
- Scalable kernel methods via doubly stochastic gradients. Advances in neural information processing systems, 27, 2014.
- Provable bayesian inference via particle mirror descent. In Artificial Intelligence and Statistics, pp. 985–994. PMLR, 2016.
- Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595–612, 2010. Publisher: INFORMS.
- A mean-field analysis of two-player zero-sum games. In Advances in Neural Information Processing Systems, volume 33, pp. 20215–20226. Curran Associates, Inc., 2020.
- On the geometry of stein variational gradient descent. arXiv preprint arXiv:1912.00894, 2019.
- Improved complexity bounds in wasserstein barycenter problem. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 1738–1746. PMLR, 2021. URL http://proceedings.mlr.press/v130/dvinskikh21a.html.
- Primal-dual methods for solving infinite-dimensional games. Journal of Optimization Theory and Applications, 166(1):23–51, Jul 2015.
- Decentralize and randomize: Faster algorithm for Wasserstein barycenters. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, NeurIPS 2018, pp. 10783–10793. Curran Associates, Inc., 2018a.
- Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1367–1376, 2018b.
- Training generative neural networks via maximum mean discrepancy optimization.
- Distributionally Robust Stochastic Optimization with Wasserstein Distance. arXiv:1604.02199 [math], July 2016. URL http://arxiv.org/abs/1604.02199. arXiv: 1604.02199.
- Stochastic optimization for large-scale optimal transport. Advances in neural information processing systems, 29, 2016.
- Generative Adversarial Networks. arXiv:1406.2661 [cs, stat], June 2014. URL http://arxiv.org/abs/1406.2661. arXiv: 1406.2661.
- Clipped stochastic methods for variational inequalities with heavy-tailed noise. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 31319–31332. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/cb0ce861adaf6f8a93069c064733f402-Paper-Conference.pdf.
- A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012. Publisher: JMLR. org.
- Improved Training of Wasserstein GANs. arXiv:1704.00028 [cs, stat], December 2017. URL http://arxiv.org/abs/1704.00028. arXiv: 1704.00028.
- On a combination of alternating minimization and Nesterov’s momentum. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 3886–3898, Virtual, 18–24 Jul 2021. PMLR. URL http://proceedings.mlr.press/v139/guminov21a.html.
- Finding mixed Nash equilibria of generative adversarial networks. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2810–2819. PMLR, 09–15 Jun 2019.
- A direct tilde {{\{{O}}\}}(1/epsilon) iteration parallel algorithm for optimal transport. Advances in Neural Information Processing Systems, 32, 2019.
- The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998. Publisher: SIAM.
- Fast yet simple natural-gradient descent for variational inference in complex models. In 2018 International Symposium on Information Theory and Its Applications (ISITA), pp. 31–35. IEEE, 2018.
- Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems, February 2024.
- Kernel Stein Discrepancy Descent. In Proceedings of the 38th International Conference on Machine Learning, pp. 5719–5730. PMLR, July 2021. ISSN: 2640-3498.
- Wasserstein-2 generative networks. arXiv preprint arXiv:1909.13082, 2019.
- Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization. arXiv:2102.01752 [cs, stat], February 2021. URL http://arxiv.org/abs/2102.01752. arXiv: 2102.01752.
- Estimation beyond data reweighting: Kernel method of moments. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 17745–17783. PMLR, July 2023.
- On the complexity of approximating Wasserstein barycenters. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3530–3540, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning. In Netessine, S., Shier, D., and Greenberg, H. J. (eds.), Operations Research & Management Science in the Age of Analytics, pp. 130–166. INFORMS, October 2019. ISBN 978-0-9906153-3-0. doi: 10.1287/educ.2019.0198.
- Mmd gan: Towards deeper understanding of moment matching network. Advances in neural information processing systems, 30, 2017.
- Continuous Regularized Wasserstein Barycenters. arXiv:2008.12534 [cs, stat], October 2020. URL http://arxiv.org/abs/2008.12534. arXiv: 2008.12534.
- Generative moment matching networks. In International conference on machine learning, pp. 1718–1727. PMLR, 2015.
- Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471 [cs, stat], September 2019. URL http://arxiv.org/abs/1608.04471. arXiv: 1608.04471.
- Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pp. 6672–6681. PMLR, 2020.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, August 2018. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1806579115. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1806579115.
- Mielke, A. An introduction to the analysis of gradients systems, June 2023. URL http://arxiv.org/abs/2306.05026. arXiv:2306.05026 [math-ph].
- Robust Stochastic Approximation Approach to Stochastic Programming. SIAM Journal on Optimization, 19(4):1574–1609, January 2009. ISSN 1052-6234, 1095-7189. doi: 10.1137/070704277. URL http://epubs.siam.org/doi/10.1137/070704277.
- Nesterov, Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2-3):319–344, 2007.
- f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Otto, F. Double degenerate diffusion equations as steepest descent. Citeseer, 1996.
- Otto, F. The geometry of dissipative evolution equations: the porous medium equation. 2001. Publisher: Taylor & Francis.
- Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019. ISSN 1935-8237. doi: 10.1561/2200000073. URL http://dx.doi.org/10.1561/2200000073.
- Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.
- Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. Publisher: Springer.
- Certifying Some Distributional Robustness with Principled Adversarial Training. arXiv:1710.10571 [cs, stat], May 2020. URL http://arxiv.org/abs/1710.10571. arXiv: 1710.10571.
- Support vector machines. Springer Science & Business Media, 2008.
- Inexact model: a framework for optimization and variational inequalities. Optimization Methods and Software, 36(6):1155–1201, 2021. doi: 10.1080/10556788.2021.1924714. URL https://doi.org/10.1080/10556788.2021.1924714.
- Generalized mirror prox algorithm for monotone variational inequalities: Universality and inexact oracle. Journal of Optimization Theory and Applications, 194(3):988–1013, Sep 2022. ISSN 1573-2878.
- Stochastic saddle-point optimization for the wasserstein barycenter problem. Optimization Letters, 16(7):2145–2175, Sep 2022. ISSN 1862-4480. doi: 10.1007/s11590-021-01834-w. URL https://doi.org/10.1007/s11590-021-01834-w. arXiv:2006.06763.
- Minimax Estimation of Kernel Mean Embeddings. Journal of Machine Learning Research, 18:1–47, 2017.
- On adversarial robustness and the use of wasserstein ascent-descent dynamics to enforce it. arXiv preprint arXiv:2301.03662, 2023.
- A dimension-free computational upper-bound for smooth optimal transport estimation. In Belkin, M. and Kpotufe, S. (eds.), Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp. 4143–4173. PMLR, 15–19 Aug 2021. URL https://proceedings.mlr.press/v134/vacher21a.html.
- Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- An exponentially converging particle method for the mixed nash equilibrium of continuous games. arXiv preprint arXiv:2211.01280, 2022.
- On the Convergence and Robustness of Adversarial Training. arXiv:2112.08304 [cs], December 2021. URL http://arxiv.org/abs/2112.08304. arXiv: 2112.08304.
- Wendland, H. Scattered data approximation, volume 17. Cambridge university press, 2004.
- Data-driven risk-averse stochastic optimization with Wasserstein metric. Operations Research Letters, 46(2):262–267, March 2018. ISSN 01676377. doi: 10.1016/j.orl.2018.01.011.
- Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pp. 280–288. PMLR, March 2021. ISSN: 2640-3498.