Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stochastic Gradient Variational Inference with Price's Gradient Estimator from Bures-Wasserstein to Parameter Space

Published 21 Feb 2026 in stat.ML, cs.LG, math.OC, and stat.CO | (2602.18718v1)

Abstract: For approximating a target distribution given only its unnormalized log-density, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) perform gradient descent in measure space (Bures-Wasserstein space) and parameter space, respectively. Previously, for the Gaussian variational family, convergence guarantees for WVI have shown superiority over existing results for black-box VI with the reparametrization gradient, suggesting the measure space approach might provide some unique benefits. In this work, however, we close this gap by obtaining identical state-of-the-art iteration complexity guarantees for both. In particular, we identify that WVI's superiority stems from the specific gradient estimator it uses, which BBVI can also leverage with minor modifications. The estimator in question is usually associated with Price's theorem and utilizes second-order information (Hessians) of the target log-density. We will refer to this as Price's gradient. On the flip side, WVI can be made more widely applicable by using the reparametrization gradient, which requires only gradients of the log-density. We empirically demonstrate that the use of Price's gradient is the major source of performance improvement.

Summary

  • The paper demonstrates that the superior performance of Wasserstein VI is driven by Price's Hessian-based gradient estimator rather than the underlying geometry.
  • The authors derive theoretical iteration complexity bounds showing that both measure-space and parameter-space methods achieve equivalent performance with Price's estimator.
  • Empirical evaluations confirm that employing Price's gradient significantly reduces variance compared to reparametrization gradients, enhancing robust Gaussian VI.

Stochastic Gradient Variational Inference Using Price's Gradient: Theoretical and Practical Unification of Bures-Wasserstein and Parameter Space Approaches

Overview

The paper "Stochastic Gradient Variational Inference with Price's Gradient Estimator from Bures-Wasserstein to Parameter Space" (2602.18718) investigates the interplay between two major approaches for variational inference (VI) in the setting where the target density π\pi is only accessible up to an unnormalized log-density. Stochastic gradient methods are the canonical tool for such problems. The first approach, Wasserstein VI (WVI), formulates VI as a gradient descent in measure space (notably the Bures-Wasserstein geometry for Gaussian variational families); the second, black-box VI (BBVI), works entirely in Euclidean parameter space. Prior results favored WVI due to superior iteration complexity bounds when employing Hessian (second-order) information for Gaussians. This paper closes this complexity gap, demonstrating that the observed theoretical and empirical advantages for WVI result primarily from its use of a superior gradient estimator—specifically, one derived from Price's theorem that leverages Hessian information—rather than any intrinsic benefit of the Wasserstein geometry. Notably, the authors provide explicit complexity results for both approaches using both Price-style and reparametrization gradients, and deliver strong empirical evidence that Price's estimator is the dominant factor in performance differences.

Theoretical Foundations

Modern VI seeks to approximate a target π\pi by minimizing the variational free energy F(q)=E(q)+H(q)\mathcal{F}(q) = \mathcal{E}(q) + \mathcal{H}(q), with the energy E(q)=U(z)q(dz)\mathcal{E}(q) = \int U(z)q(dz) and entropy H\mathcal{H}. The population version of VI is traditionally addressed with gradient descent using stochastic estimators, often either in a parameterized family (e.g., the Gaussian family as a Bures-Wasserstein space) or in the parameter space Λ\Lambda directly. A critical aspect of tractability for these methods is the estimation of intractable expectations with stochastic gradients.

Gradient Estimators: Reparametrization vs. Price

BBVI typically estimates gradients via the reparametrization trick, which utilizes only first-order information and is applicable to a broad class of variational families. In contrast, WVI for Gaussians has utilized a Hessian-based estimator derived from Price's theorem, offering lower variance due to use of second-order information. The Hessian-based estimator (Price's gradient) is:

  • For the covariance Σ\Sigma of a Gaussian: ΣEqf=12Eq[2f]\nabla_\Sigma \mathbb{E}_q f = \frac{1}{2} \mathbb{E}_q [\nabla^2 f] (Price's theorem).
  • For the natural parameterization, this equates to the "Stein" identity linking first and second derivatives of UU.

The paper highlights that the theoretical iteration complexity gap observed between WVI and BBVI with standard first-order gradients vanishes when both use the same (Hessian-based) Price estimator.

Complexity Results

Strong convexity and smoothness are assumed for the log-density UU, yielding μ\mu-strong convexity and LL-smoothness in the energy. Prior work gave WVI superior non-asymptotic rates, but this was based on comparing Hessian-based gradients for WVI to first-order gradients for BBVI.

This work demonstrates that, with Price's estimator, both SPBWGD (stochastic proximal Bures-Wasserstein gradient descent) and SPGD (stochastic proximal gradient descent in Euclidean parameter space) attain matching iteration complexities:

O(dκ1ϵ+dκ3/2log(κΔ2)1ϵ+κ2log(Δ2ϵ))\mathcal{O}\left( d\kappa \frac{1}{\epsilon} + \sqrt{d}\kappa^{3/2}\log(\kappa \Delta^2)\frac{1}{\sqrt{\epsilon}} + \kappa^2 \log\left(\frac{\Delta^2}{\epsilon}\right) \right)

where κ=L/μ\kappa = L/\mu and Δ\Delta measures the initialization gap. This enhances earlier bounds for both approaches, showing the observed differences are due to estimator variance, not geometry. Using only reparametrization gradients, convergence is worse by factors correlated with ill-conditioning and the trace of the optimal covariance.

The analysis carefully relates gradient estimator variance, Bregman divergences, and measure geometry, and shows that both the analysis methods and the central contraction/variance properties are essentially identical modulo the gradient estimator involved.

Algorithmic and Practical Insights

Algorithmic Adaptability

A key theoretical contribution is showing that both measure-based WVI and parameter-based BBVI can use either estimator:

  • WVI can use the reparametrization gradient, extending it to situations with only first-order access to UU.
  • BBVI can use Price's gradient, provided Hessian access is available, closing the gap in complexity and performance.

This removes previous limitations on applicability and indicates that estimator selection—rather than choice of geometry—drives the iteration complexity for Gaussian VI.

Empirical Evaluation

Extensive empirical tests on high-dimensional Bayesian benchmarks confirm the analysis:

  • Both SPGD and SPBWGD achieve similar performance when both use Price's gradient, except for special pathological cases due to the way noise is handled in matrix updates.
  • Use of the reparametrization gradient in either SPGD or SPBWGD leads to poor performance relative to Price's estimator, requiring much smaller step sizes for convergence and delivering higher free energies at any fixed step count.
  • The performance gap between measure-based and parameter-based schemes is much smaller than between either using reparametrization and Hessian-based gradients.

The authors argue that empirically, the main performance lever in practical Gaussian VI is the variance reduction from the Hessian-based estimator, not the geometry of the update itself.

Implications and Future Research Directions

These results are significant for both theoretical understanding of VI dynamics and practical development of scalable inference systems:

  • Estimator quality is primary: For Gaussian families, the structural and empirical advantages attributed to second-order Wasserstein methods in the literature are reconstructible through adoption of matched Hessian-based estimators in parameter space.
  • Generalization to other families: While the focus is Gaussians, alongside the Bures-Wasserstein manifold structure, open questions remain for fully nonparametric cases, flexible normalizing flows, and hierarchical or non-conjugate models. The impact of estimator variance in these broader settings warrants further investigation.
  • Natural gradient and information geometry: The study points out that natural-gradient VI may offer further speed-ups due to Riemannian preconditioning; full theoretical analysis in nonconjugate cases is still pending and represents a promising direction.
  • Numerical and computational cost: The Hessian-based estimators have cubic computational scaling in dd (due to full Hessian-vector operations) compared with quadratic for first-order estimators—important for very high-dimensional finite models or latent variable models with structure. Optimal trade-offs between estimator variance and computational cost, possibly via Hessian-vector products or low-rank approximations, deserve further study.
  • Robustness considerations: Nontrivial numerical stability issues (e.g., with matrix square roots and Cholesky decompositions in WVI) might make parameter-space schemes operationally preferable in practical software, even when the iteration complexity is equivalent.

Conclusion

This work provides a definitive account of the equivalence, in both theory and practice, of measure-space (Wasserstein) and parameter-space (Euclidean) approaches to stochastic gradient VI for Gaussian variational families, conditional on the use of the same gradient estimator. The main finding is that the iteration complexity and practical performance differences observed in previous comparisons derive from the use of Price's Hessian-based gradient estimator, not from any intrinsic geometric property of the underlying optimization space. This unifies the theoretical analyses for both approaches, demonstrates how to obtain optimal iteration complexity previously found only in Wasserstein VI in BBVI, and significantly informs the future development of stochastic variational inference algorithms (2602.18718).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 10 likes about this paper.