Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs (2403.13748v2)

Published 20 Mar 2024 in stat.ML, cs.LG, and stat.CO

Abstract: Given an intractable distribution $p$, the problem of variational inference (VI) is to find the best approximation from some more tractable family $Q$. Commonly, one chooses $Q$ to be a family of factorized distributions (i.e., the mean-field assumption), even though~$p$ itself does not factorize. We show that this mismatch leads to an impossibility theorem: if $p$ does not factorize, then any factorized approximation $q\in Q$ can correctly estimate at most one of the following three measures of uncertainty: (i) the marginal variances, (ii) the marginal precisions, or (iii) the generalized variance (which can be related to the entropy). In practice, the best variational approximation in $Q$ is found by minimizing some divergence $D(q,p)$ between distributions, and so we ask: how does the choice of divergence determine which measure of uncertainty, if any, is correctly estimated by VI? We consider the classic Kullback-Leibler divergences, the more general R\'enyi divergences, and a score-based divergence which compares $\nabla \log p$ and $\nabla \log q$. We provide a thorough theoretical analysis in the setting where $p$ is a Gaussian and $q$ is a (factorized) Gaussian. We show that all the considered divergences can be \textit{ordered} based on the estimates of uncertainty they yield as objective functions for~VI. Finally, we empirically evaluate the validity of this ordering when the target distribution $p$ is not Gaussian.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Bernardo, J. M. and A. F. M. Smith (2000). Bayesian Theory. Wiley.
  2. Bounding Wasserstein distance with couplings. Journal of the American Statistical Association. DOI: 10.1080/01621459.2023.2287773.
  3. Variational inference: A review for statisticians. Journal of the American Statistical Association 112, 859–877.
  4. JAX: composable transformations of Python+NumPy programs.
  5. Burbea, J. (1984). The convexity with respect to Gaussian distributions of divergences of order α𝛼\alphaitalic_α. Utilitas Mathematica 26, 171–192.
  6. Batch and match: black-box variational inference with a score-based divergence. arXiv:2402.14758.
  7. Cichocki, A. and S.-i. Amari (2010). Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy 12, 1532–1568.
  8. Courtade, T. A. (2016). Links between the logarithmic Sobolev inequality and the convolution inequalities for entropy and Fisher information. arXiv:1608.05431.
  9. Cover, T. M. and J. A. Thomas (2006). Elements of Information Theory. John Wiley & Sons, Inc.
  10. Alpha-divergence variational inference meets importance weighted auto-encoders: Methodology and asymptotics. Journal of Machine Learning Research 24, 1–83.
  11. Infinite-dimensional gradient-based descent for alpha-divergence minimisation. The Annals of Statistics 49, 2250–2270.
  12. Challenges and opportunities in high-dimensional variational inference. In Advances in Neural Information Processing Systems 34, pp.  7787–7798.
  13. Variational inference via χ𝜒\chiitalic_χ upper bound minimization. In Advances in Neural Information Processing Systems 30, pp.  2732–2741.
  14. UCL machine learning repository.
  15. On the difficulty of unbiased alpha divergence minimization. In Proceedings of the 38th International Conference on Machine Learning, pp.  3650–3659.
  16. Bayesian Data Analysis. Chapman & Hall/CRC Texts in Statistical Science.
  17. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge University Press.
  18. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences 249, 124–131.
  19. Covariances, robustness, and variational bayes. Journal of Machine Learning Research 19, 1–49.
  20. bridgesampling: An R package for estimating normalizing constants. Journal of Statistical Software 92.
  21. Black-box alpha divergence minimization. In Proceedings of the 33rd International Conference on Machine Learning, pp.  1511–1520.
  22. Statistical Methodology 6, 424–436.
  23. Hoffman, M. D. and D. M. Blei (2015). Stochastic structured variational inference. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp.  361–369.
  24. Horn, R. A. and C. R. Johnson (2012). Matrix analysis. Cambridge University Press.
  25. Validated variational inference via practical posterior error bounds. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pp.  1792–1802.
  26. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems 29, pp.  2946–2954.
  27. An introduction to variational methods for graphical models. Machine Learning 37, 183–233.
  28. Stochastic volatility: likelihood inference and comparison with arch models. The Review of Economic Studies 65, 361–393.
  29. Automatic differentiation variational inference. Journal of Machine Learning Research 18, 1–45.
  30. Li, Y. and R. E. Turner (2016). Rényi divergence variational inference. In Advances in Neural Information Processing Systems 29, pp.  1073–1081.
  31. Convex Statistical Distances. Leipzig: Teubner.
  32. MacKay, D. J. (2003). Information theory, inference, and learning algorithms.
  33. Margossian, C. C. and L. K. Saul (2023). The shrinkage-delinkage trade-off: An analysis of factorized gaussian approximations for variational inference. In Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence, pp.  1358–1367.
  34. Warp bridge sampling. Journal of Computational and Graphical Statistics 11, 552–586.
  35. Minka, T. (2005). Divergence measures and message passing. Technical Report MSR-TR-2005-173.
  36. Nordström, K. (2011). Convexity of the inverse and Moore-Penrose inverse. Linear Algebra and Its Applications 434, 1489–1512.
  37. The variational gaussian approximation revisited. Neural Computation 21(3), 786–792.
  38. Parisi, G. (1988). Statistical Field Theory. Addison-Wesley.
  39. Monte Carlo Statistical Methods. Springer.
  40. Rosenbrock, H. H. (1960). An automatic method for finding the greatest or least value of a function. Computer Journal 3, 175–184.
  41. Rubin, D. B. (1981). Estimation in parallelized randomized experiments. Journal of Educational Statistics 6, 377–400.
  42. Inference gym.
  43. Tomczak, J. M. (2022). Deep generative modeling. Springer.
  44. Two problems with variational expectation maximisation for time-series models. In D. Barber, A. T. Cemgil, and S. Chiappa (Eds.), Bayesian Time series models, Chapter 5, pp.  109–130. Cambridge University Press.
  45. Expectation propagation as a way of life: A framework for bayesian inference on partitioned data. Journal of Machine Learning 21, 1–53.
  46. Pareto smoothed importance sampling. Journal of Machine Learning Research. To appear.
  47. Wainwright, M. J. and M. I. Jordan (2008). Foundations and Trends in Machine Learning 1(1–2), 1–305.
  48. Wang, Y. and D. M. Blei (2018). Frequentist consistency of variational bayes. Journal of the American Statistical Association 114, 1147–1161.
  49. Yes, but did it work?: Evaluating variational inference. In Proceedings of the 35th International Conference on Machine Learning, pp.  5577–5586.
  50. Pathfinder: Parallel quasi-Newton variational inference. Journal of Machine Learning Research 23(306), 1–49.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Charles C. Margossian (20 papers)
  2. Loucas Pillaud-Vivien (19 papers)
  3. Lawrence K. Saul (9 papers)
Citations (1)

Summary

  • The paper presents an impossibility theorem showing that factorized Gaussian VI cannot simultaneously match a target’s variance, precision, and entropy.
  • It compares divergences such as reverse KL, forward KL, Rényi, and score-based, revealing how each emphasizes different aspects of the target distribution.
  • Numerical experiments validate the divergence ordering and demonstrate that specific Rényi parameters can uniquely match the target’s entropy.

An Analysis of Divergences for Variational Inference with Diagonal Gaussian Approximations

Introduction

Variational inference (VI) is a mainstay in the landscape of computational statistics and machine learning, offering a tractable means of approximating complex distributions. The core of VI lies in the choice of divergence to minimize between an approximating distribution and the target distribution. While the Kullback-Leibler (KL) divergence enjoys widespread use, alternative divergences have been proposed, yet their comparative performance remains under-explored. This paper investigates how the choice of divergence influences the efficacy of VI, specifically when approximating a Gaussian distribution using a Gaussian with a diagonal covariance matrix. We examine the KL divergence, R\'enyi divergences, and score-based divergences, shedding light on how each divergence's properties translate into the behavior of VI approximations.

Impossibility Theorem for FG-VI

At the heart of our analysis is an impossibility theorem delineating the inherent trade-offs in factorized Gaussian variational inference (FG-VI). The theorem reveals that no factorized approximation can simultaneously match the target distribution’s variance, precision, and entropy. This theorem frames the understanding that the selection of a divergence tailors the VI approximation towards capturing specific characteristics of the target distribution, inherently compromising on others.

Divergence Analysis and Ordering

The paper then explores the optimizations spawned by different divergences in a Gaussian setting. The reverse KL divergence is shown to match the target’s marginal precisions but underestimates its variance and entropy. Conversely, the forward KL divergence matches the target’s variance at the cost of precision and overestimates entropy. R\'enyi divergence, dependent on the parameter α\alpha, interpolates between these behaviors, offering a tuning knob between variance and precision matching, and uniquely has an entropy matching condition for a specific α\alpha.

The score-based divergences introduce a distinct perspective, where the optimization problem transforms into a quadratic program. Notably, these divergences can predict marginal variances that are zero or infinite, a phenomenon we term "variational collapse," marking a stark departure from the KL and R\'enyi frameworks.

Building on these insights, the paper delivers a comprehensive ordering of the divergences based on the estimated marginal variances from VI, extending naturally to ordering based on precision and entropy. This hierarchy guides the selection of divergences based on the inferential goals of VI.

Empirical Validation and Entropy Matching

The theoretical insights are juxtaposed with numerical experiments across several models, from Gaussian distributions to complex hierarchical and time series models. These experiments confirm the divergence ordering in specific cases, while also highlighting the prevalence of entropy-based ordering in approximating non-Gaussian targets. Remarkably, each R\'enyi divergence provides a unique α\alpha that matches the target’s entropy, although finding this α\alpha is non-trivial.

Conclusion

This paper makes strides in unpacking the effects of divergence choice on the outcomes of VI, particularly in the field of Gaussian approximations. The impossibility theorem and the systematic ordering of divergences offer a conceptual framework to anticipate the behavior of VI approximations, assisting practitioners in aligning their divergence choice with their inferential objectives. Future work may explore these trade-offs in richer variational families or under different model assumptions, paving the way for more nuanced applications of VI.