Variational Lower Bound and Negative ELBO
- Variational Lower Bound (Negative ELBO) is a core metric that quantifies the gap between approximate and true posteriors in probabilistic models.
- It is decomposed into KL divergence and reconstruction terms, enabling clear rate–distortion trade-offs in variational autoencoders and related models.
- Recent advances leverage gradient estimators, semi-implicit methods, and geometric interpretations to enhance optimization, diagnostics, and convergence analysis.
A variational lower bound, typically called the Evidence Lower Bound (ELBO), is a central objective in variational inference for probabilistic models, particularly in variational autoencoders (VAEs) and Bayesian latent variable models. The negative ELBO—often referred to as the variational free energy—serves as a minimization target and quantifies the gap between an approximate posterior and the true posterior. In modern research, the development and analysis of negative ELBO-based objectives have become essential for model design, gradient estimator theory, and understanding information-theoretic properties of generative models.
1. Formal Definition and Standard Decomposition
The ELBO for observed data and latent variable , under a model and variational distribution (or for amortized inference), is defined as: Its negative form—minimized in practice—is
Minimizing is equivalent to minimizing the Kullback–Leibler divergence to the true posterior: where is constant with respect to 0 (Yin et al., 2018, Chérief-Abdellatif, 2018). In amortized settings (e.g., VAEs), one commonly writes: 1 This decomposition tracks rate (KL term) and distortion (reconstruction term) as in rate–distortion theory (Alemi et al., 2017, Lastras, 2019).
2. Information-Theoretic Interpretations and Structural Decomposition
At stationary points for a wide class of models, the ELBO can be expressed as a sum (and differences) of entropies: 2 For standard exponential family models (notably Gaussian VAEs), this reduces to a tractable, closed-form expression in terms of parameterized variances and means (Damm et al., 2020, Lücke et al., 2022). This entropy sum characterization enables efficient, variance-free evaluation of the ELBO at convergence, and provides principled diagnostics for phenomena such as posterior collapse. The negative ELBO, 3, thus admits an interpretation as the aggregate mismatch and compression cost imposed by approximate inference.
3. Advanced Variational Families and Sandwich Bounds
Generative models increasingly employ variational families for which the marginal density 4 is intractable. Semi-implicit variational inference (SIVI) constructs two-level mixtures: 5 where both 6 and 7 are reparameterizable but not necessarily tractable jointly (Yin et al., 2018). The SIVI framework defines Monte Carlo-based lower and upper bounds that sandwich the true ELBO: 8 where 9 is the number of auxiliary samples. Both bounds converge monotonically to the true ELBO as 0. SIVI provides an unbiased surrogate objective whose gradient can be estimated stochastically, and which, for finite 1, is always a valid lower bound. This approach generalizes to doubly semi-implicit settings, where both prior and variational distributions are semi-implicit mixtures, preserving the sandwich property (Molchanov et al., 2018).
4. Gradient Estimators and Optimization Strategies
Gradient-based optimization of the negative ELBO typically employs two classes of estimators:
- Score-function (REINFORCE) estimators: Use the identity
2
but suffer from high variance (Dib, 2020).
- Reparameterization (pathwise) estimators: For 3, 4, leverage
5
When available, reparameterization yields lower-variance, unbiased estimates.
Variance reduction can be achieved by deterministic quasi-Monte Carlo or quantization schemes; for instance, Quantized Variational Inference (QVI) replaces Monte Carlo with optimal cubature over quantized support points, yielding zero-variance (but biased) gradients. The bias decays polynomially with the number of quantization points, and Richardson extrapolation can further reduce bias (Dib, 2020).
For the VR-IWAE class of bounds, the choice of estimator (reparameterized or doubly-reparameterized) affects signal-to-noise ratio (SNR) scaling with sample size 6 and model class (Daudel et al., 2024). In high-dimensional regimes, importance weight collapse may nullify SNR gains unless 7 is exponentially large in 8.
5. Extensions: Rate–Distortion, Thermodynamic, and Discrete Variants
Interpreting the negative ELBO through the lens of rate–distortion theory, minimizing 9 corresponds to the sum 0, where the first term measures information encoding cost and the second term, distortion (Lastras, 2019, Alemi et al., 2017). This framework clarifies tradeoffs and motivates alternative objectives—e.g., enforcing minimum mutual information or rate lower bounds (free bits) to prevent posterior collapse (Alemi et al., 2017). Thermodynamic Variational Objectives (TVO) further generalize the ELBO via path integration over interpolations between the variational posterior and the model joint, yielding tighter bounds via Riemann sum approximations (Masrani et al., 2019).
For graphical models with discrete latent variables, entropy and expectations under expressive distributions (e.g., selective-SPNs) can be computed exactly, circumventing the limitations of sampling-based estimators and enabling direct optimization of the negative ELBO (Shih et al., 2020).
6. Practical Implementation and Model Selection
During training, the negative ELBO serves as the objective for stochastic gradient methods. Its minimization encourages the variational family to approach the true posterior while maximizing the marginal likelihood lower bound. The variance properties of the chosen gradient estimator, the tractability of the variational family, and the potential for bound tightness (e.g., via importance weighting or surrogate bounds) directly impact practical learning dynamics (Yin et al., 2018, Dib, 2020, Daudel et al., 2024).
For model selection, penalized ELBO approaches have been shown to yield consistent estimators even under model misspecification, provided suitable prior mass conditions (Chérief-Abdellatif, 2018). Closed-form entropy decompositions enable more efficient post-training model diagnostics and facilitate interpretable control over different components of the inference objective (Damm et al., 2020, Lygerakis et al., 2024).
7. Geometric, Asymptotic, and Theoretical Perspectives
Recent work situates the negative ELBO as a Bregman divergence—specifically 1 with respect to the exponential family log-partition function 2. This geometric perspective underpins rigorous convergence bounds for gradient-based algorithms, with convergence rates governed by spectral properties of the Fisher information matrix (Bohara et al., 17 Oct 2025).
In entropy-sum formulations, the negative ELBO at stationary points can be fully characterized in terms of entropies and cross-entropies of the variational distribution, the prior, and the conditional model distribution. These results extend to generalized exponential families and remain valid under broad practical conditions (finite/infinite data, deep networks, saddle or local optima) (Damm et al., 2020, Lücke et al., 2022).
Selected references:
- Semi-Implicit Variational Inference (Yin et al., 2018)
- Quantized Variational Inference (Dib, 2020)
- On the Convergence of the ELBO to Entropy Sums (Lücke et al., 2022)
- The ELBO of Variational Autoencoders Converges to a Sum of Three Entropies (Damm et al., 2020)
- Fixing a Broken ELBO (Alemi et al., 2017)
- Geometric Convergence Analysis of Variational Inference via Bregman Divergences (Bohara et al., 17 Oct 2025)
- Learning with Importance Weighted Variational Inference (Daudel et al., 2024)
- Doubly Semi-Implicit Variational Inference (Molchanov et al., 2018)
- ED-VAE: Entropy Decomposition of ELBO in Variational Autoencoders (Lygerakis et al., 2024)
- Information Theoretic Lower Bounds on Negative Log Likelihood (Lastras, 2019)