Minimizing $f$-Divergences by Interpolating Velocity Fields (2305.15577v3)
Abstract: Many machine learning problems can be seen as approximating a \textit{target} distribution using a \textit{particle} distribution by minimizing their statistical discrepancy. Wasserstein Gradient Flow can move particles along a path that minimizes the $f$-divergence between the target and particle distributions. To move particles, we need to calculate the corresponding velocity fields derived from a density ratio function between these two distributions. Previous works estimated such density ratio functions and then differentiated the estimated ratios. These approaches may suffer from overfitting, leading to a less accurate estimate of the velocity fields. Inspired by non-parametric curve fitting, we directly estimate these velocity fields using interpolation techniques. We prove that our estimators are consistent under mild conditions. We validate their effectiveness using novel applications on domain adaptation and missing data imputation.
- Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
- Refining deep generative models via discriminator gradient flow. In International Conference on Learning Representations (ICLR 2021), 2021.
- Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- Svgd as a kernelized wasserstein gradient flow of the chi-squared divergence. In Advances in Neural Information Processing Systems (NeurIPS 2020), volume 33, pages 2098–2109, 2020.
- Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems (NeurIPS 2017), volume 30, 2017a.
- Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853–1865, 2017b.
- Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59(10):1087–1091, 2006.
- Asymptotic evaluation of certain markov process expectations for large time—iii. Communications on Pure and Applied Mathematics, 29(4):389–461, 1976.
- J. Fan. Local Linear Regression Smoothers and Their Minimax Efficiencies. The Annals of Statistics, 21(1):196 – 216, 1993.
- Deep generative learning via variational gradient flow. In International Conference on Machine Learning (ICML 2019), pages 2093–2101, 2019.
- T. Gasser and H-G Müller. Kernel estimation of regression functions. In T. Gasser and M. Rosenblatt, editors, Smoothing Techniques for Curve Estimation, pages 23–68, Berlin, Heidelberg, 1979. Springer Berlin Heidelberg.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001.
- A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6:695–709, 2005.
- A. Hyvärinen. Some extensions of score matching. Computational statistics & data analysis, 51(5):2499–2512, 2007.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR 2015), 2015.
- Q. Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems (NeurIPS 2017), volume 30, pages 3118–3126, 2017.
- Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems (NeurIPS 2016), volume 29, pages 2378–2386, 2016.
- Interacting particle solutions of fokker–planck equations through gradient–log–density estimation. Entropy, 22(8):802, 2020.
- Missing data imputation using optimal transport. In International Conference on Machine Learning (ICML 2020), pages 7130–7140, 2020.
- E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141–142, 1964.
- Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
- f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems (NeurIPS 2016), volume 29, 2016.
- D. W. Scott. Feasibility of multivariate density estimates. Biometrika, 78(1):197–205, 1991.
- Variational likelihood-free gradient descent. In Fourth Symposium on Advances in Approximate Bayesian Inference (AABI 2021), 2021.
- Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence (UAI 2020), pages 574–584, 2020.
- Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012.
- P. Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
- Projected wasserstein gradient descent for high-dimensional bayesian inference. SIAM/ASA Journal on Uncertainty Quantification, 10(4):1513–1532, 2022.
- L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated, 2010.
- G. S. Watson. Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 26(4):359–372, 1964.
- A. Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027, 2018.
- Monoflow: Rethinking divergence gans via the perspective of wasserstein gradient flows. In International Conference on Machine Learning (ICML 2023), pages 39984–40000, 2023.
- Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning (ICML 2018), pages 5689–5698, 2018.
- Divergence optimization for noisy universal domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR 2021), pages 2515–2524, 2021.
- M. Zwitter and M. Soklic. Breast Cancer. UCI Machine Learning Repository, 1988.