Optimising Distributions with Natural Gradient Surrogates (2310.11837v2)
Abstract: Natural gradient methods have been used to optimise the parameters of probability distributions in a variety of settings, often resulting in fast-converging procedures. Unfortunately, for many distributions of interest, computing the natural gradient has a number of challenges. In this work we propose a novel technique for tackling such issues, which involves reframing the optimisation as one with respect to the parameters of a surrogate distribution, for which computing the natural gradient is easy. We give several examples of existing methods that can be interpreted as applying this technique, and propose a new method for applying it to a wide variety of problems. Our method expands the set of distributions that can be efficiently targeted with natural gradients. Furthermore, it is fast, easy to understand, simple to implement using standard autodiff software, and does not require lengthy model-specific derivations. We demonstrate our method on maximum likelihood estimation and variational inference tasks.
- S.-i. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, 1998.
- A. Azzalini. The Skew-Normal and Related Families. Cambridge University Press, 2013.
- Exact natural gradient in deep linear networks and its application to the nonlinear case. In Advances in Neural Information Processing Systems, 2018.
- D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 1997.
- J. Blackard. Covertype. UCI Machine Learning Repository, 1998.
- Efficient and modular implicit differentiation. In Advances in Neural Information Processing Systems, 2022.
- JAX: composable transformations of Python+NumPy programs, 2018.
- LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011.
- B. Christianson. Reverse accumulation and attractive fixed points. Optimization Methods & Software, 1994.
- P. R. Dewick and S. Liu. Copula modelling to analyse financial data. Journal of Risk and Financial Management, 2022.
- R. A. Fisher. The negative binomial distribution. Annals of Eugenics, 1941.
- Elliptical copulas: applicability and limitations. Statistics & Probability Letters, 2003.
- Fisher-legendre (fishleg) optimization of deep neural networks. In International Conference on Learning Representations, 2023.
- Fast approximate natural gradient descent in a kronecker-factored eigenbasis. In Neural Information Processing Systems, 2018.
- W. C. Guenther. A simple approximation to the negative binomial (and regular binomial). Technometrics, 1972.
- Distributed bayesian learning with stochastic natural gradient expectation propagation and the posterior server. Journal of Machine Learning Research, 2017.
- Fast variational inference in the conjugate exponential family. In Advances in Neural Information Processing Systems, 2012.
- Gaussian processes for big data. In Uncertainty in Artificial Intelligence, 2013.
- Maximum likelihood estimation of the correlation parameters for elliptical copulas. arXiv:1412.6316, 2014.
- T. Heskes. On “Natural” Learning and Pruning in Multilayered Perceptrons. Neural Computation, 2000.
- Stochastic variational inference. Journal of Machine Learning Research, 2013.
- S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, 2001.
- Epidemiological impacts of the nhs covid-19 app in england and wales throughout its first year. Nature Communications, 2023.
- M. Khan and W. Lin. Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models. In International Conference on Artificial Intelligence and Statistics, 2017.
- Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In International Conference on Machine Learning, 2018.
- Variational adaptive-newton method. In NeurIPS Workshop on Advances in Approximate Bayesian Inference, 2017.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.
- D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations, 2014.
- Limitations of the empirical fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems, 2019.
- Gradient descent only converges to minimizers. In Conference on Learning Theory, 2016.
- Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In International Conference on Machine Learning, 2019.
- Superspreading and the effect of individual variation on disease emergence. Nature, 2005.
- J. O. Lloyd-Smith. Maximum likelihood estimation of the negative binomial dispersion parameter for highly overdispersed data, with applications to infectious diseases. PLOS ONE, 2007.
- J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, 2015.
- T. Minka. Estimating a dirichlet distribution. Technical report, Microsoft, September 2000.
- T. Minka. Estimating a gamma distribution. Technical report, Microsoft, April 2002.
- K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
- J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, NY, USA, 2e edition, 2006.
- Information-geometric optimization algorithms: A unifying picture via invariance principles. Journal of Machine Learning Research, 2017.
- Factors associated with length of stay in hospital among the elderly patients using count regression models. Medical Journal of the Islamic Republic Of Iran, 2021.
- Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, 2017.
- Y. Ren and D. Goldfarb. Efficient subsampled gauss-newton and natural gradient methods for training neural networks. arXiv:1906.02353, 2019.
- B. Roe. MiniBooNE particle identification. UCI Machine Learning Repository, 2010.
- Topmoumoute online natural gradient algorithm. In Advances in Neural Information Processing Systems, 2007.
- Natural gradients in practice: Non-conjugate variational inference in gaussian process models. In International Conference on Artificial Intelligence and Statistics, 2018.
- M.-A. Sato. Online model selection based on the variational bayes. Neural Computation, 2001.
- Late-phase second-order training. In Has it Trained Yet? NeurIPS Workshop, 2022.
- M. Wainwright and M. Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, 2008.