Generalized Variational Inference: Three arguments for deriving new Posteriors (1904.02063v4)

Published 3 Apr 2019 in stat.ML, cs.AI, and cs.LG

Abstract: We advocate an optimization-centric view on and introduce a novel generalization of Bayesian inference. Our inspiration is the representation of Bayes' rule as infinite-dimensional optimization problem (Csiszar, 1975; Donsker and Varadhan; 1975, Zellner; 1988). First, we use it to prove an optimality result of standard Variational Inference (VI): Under the proposed view, the standard Evidence Lower Bound (ELBO) maximizing VI posterior is preferable to alternative approximations of the Bayesian posterior. Next, we argue for generalizing standard Bayesian inference. The need for this arises in situations of severe misalignment between reality and three assumptions underlying standard Bayesian inference: (1) Well-specified priors, (2) well-specified likelihoods, (3) the availability of infinite computing power. Our generalization addresses these shortcomings with three arguments and is called the Rule of Three (RoT). We derive it axiomatically and recover existing posteriors as special cases, including the Bayesian posterior and its approximation by standard VI. In contrast, approximations based on alternative ELBO-like objectives violate the axioms. Finally, we study a special case of the RoT that we call Generalized Variational Inference (GVI). GVI posteriors are a large and tractable family of belief distributions specified by three arguments: A loss, a divergence and a variational family. GVI posteriors have appealing properties, including consistency and an interpretation as approximate ELBO. The last part of the paper explores some attractive applications of GVI in popular machine learning models, including robustness and more appropriate marginals. After deriving black box inference schemes for GVI posteriors, their predictive performance is investigated on Bayesian Neural Networks and Deep Gaussian Processes, where GVI can comprehensively improve upon existing methods.

Citations (101)

View on Semantic Scholar

Summary

The paper introduces an optimization-centric view of Bayesian inference by framing Bayes' rule as an infinite-dimensional optimization problem.
It proposes the Rule of Three, which combines a loss function, divergence, and feasible space to derive robust posterior distributions.
GVI improves practical applications in complex models by generating posteriors that are less sensitive to model mis-specification and computational constraints.

Generalized Variational Inference: Some New Perspectives on Bayesian Posteriors

The paper "Generalized Variational Inference: Three arguments for deriving new Posteriors" by Jeremias Knoblauch, Jack Jewson, and Theo Damoulas proposes a novel approach to Bayesian inference, addressing limitations of standard methodologies when applied to modern statistical and machine learning problems. This paper critically examines the assumptions underlying traditional Bayesian methods and explores the implications of these assumptions in the context of large-scale inference variables and models.

Motivation and Background

The paper is driven by the realization that the classical Bayesian framework, while powerful in its probabilistic foundation, is often misaligned with the needs and realities of contemporary statistical applications. Key assumptions in Bayesian inference include well-specified priors, accurate likelihood models, and infinite computational resources. These assumptions are often violated in practice, especially in complex machine learning models where computational constraints, model misfits, and non-informative or default priors are commonplace.

Key Contributions

Optimization-centric View: The authors propose an optimization-focused perspective on Bayesian inference, presenting Bayes' rule as an infinite-dimensional optimization problem. This is a notable step toward reshaping Bayesian inference as part of regularized optimization, akin to empirical risk minimization but with a probabilistic interpretation.
Rule of Three (RoT): The paper introduces the Rule of Three, a generalized form of posterior distribution. This framework is defined by three arguments: a loss function, a divergence (to regularize deviation from the prior), and a feasible space for optimization. This formulation explicitly acknowledges and addresses the computational and conceptual constraints of conventional Bayesian posteriors.
Generalized Variational Inference (GVI): Building on RoT, Generalized Variational Inference constrains the feasible space to a tractable, parameterized family, focusing on practical applicability. GVI posteriors provide a flexible yet robust mechanism to produce posterior distributions that are insensitive to traditional Bayesian pitfalls such as model or prior mis-specification.

Theoretical Implications

The paper highlights several theoretical insights:

Modularity and Flexibility: The RoT framework is modular, allowing for targeted adjustments to improve robustness against model and prior mis-specification while retaining computational feasibility.
Consistency and Bounds: The paper demonstrates GVI posterior consistency under certain conditions, indicative of its reliability in capturing distributions over large data. The authors also interpret certain GVI objectives as a lower bound approximation to Bayesian evidences, aligning with Bayesian predictive accuracy in practice.

Applications

Bayesian Neural Networks (BNN): In contexts where priors might be default or poorly specified, GVI offers a way to construct posteriors that are less sensitive to prior inaccuracies, significantly improving predictive performance by focusing posterior mass around empirical risk minimizers.
Deep Gaussian Processes (DGP): The flexibility of GVI in adjusting loss functions via robust scoring rules demonstrates its utility in addressing modern inference challenges where traditional likelihood models might inadequately capture data heterogeneity or outliers.

Practical Implications

The authors implement variational methods, which adapt to both computation constraints and model complexity, showing improvements in prediction error and log likelihood across several datasets. Implementation details, including sensitivity to hyperparameters and empirical validation against conventional methods, are critical for applying these techniques effectively.

Future Directions

The paper opens several avenues for future inquiry, such as exploring connections with PAC-Bayes and information theory, enhancing hyperparameter tuning processes for optimal predictive and uncertainty quantification, and characterizing contraction rates across various divergence metrics.

This paper argues for a reinterpretation of Bayesian machine learning, proposing a framework that retains probabilistic rigor while incorporating empirical data complexity effectively. It challenges researchers to rethink Bayesian updates in light of modern computational and inferential challenges, urging a shift towards more adaptable, transparent, and computationally feasible methods.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ogrisel/status/1859615605822742710

YouTube

Show All Videos