Papers
Topics
Authors
Recent
Search
2000 character limit reached

KL Divergence Between Gaussians: A Step-by-Step Derivation for the Variational Autoencoder Objective

Published 13 Apr 2026 in cs.LG | (2604.11744v1)

Abstract: Kullback-Leibler (KL) divergence is a fundamental concept in information theory that quantifies the discrepancy between two probability distributions. In the context of Variational Autoencoders (VAEs), it serves as a central regularization term, imposing structure on the latent space and thereby enabling the model to exhibit generative capabilities. In this work, we present a detailed derivation of the closed-form expression for the KL divergence between Gaussian distributions, a case of particular importance in practical VAE implementations. Starting from the general definition for continuous random variables, we derive the expression for the univariate case and extend it to the multivariate setting under the assumption of diagonal covariance. Finally, we discuss the interpretation of each term in the resulting expression and its impact on the training dynamics of the model.

Authors (2)

Summary

  • The paper demonstrates a precise analytic derivation of the KL divergence between two multivariate Gaussians in the context of Variational Autoencoders.
  • It outlines key matrix properties, including trace operations and log-determinant manipulations, for efficient computation of the ELBO.
  • The derivation provides insights into VAE regularization and suggests future exploration of non-diagonal covariance formulations for richer representations.

Step-by-Step Derivation of KL Divergence Between Gaussians in the VAE Objective

Introduction

This paper provides a rigorous derivation of the Kullback--Leibler (KL) divergence between two multivariate Gaussian distributions, a computation foundational for the objective function of Variational Autoencoders (VAEs) (2604.11744). The KL divergence serves a critical role as a regularization term in VAEs, penalizing discrepancies between the learned approximate posterior and a chosen prior, typically the standard normal. This detailed exposition moves from the general information-theoretic definition of KL divergence to a closed-form solution suited for high-dimensional, diagonal covariance settings, directly applicable to scalable deep generative models.

KL Divergence: Definition and Properties

KL divergence quantifies how one probability distribution diverges from a second, expected distribution. For distributions PP and QQ over X\mathcal{X} with densities pp and qq, the formulation is

DKL(PQ)=Xp(x)log(p(x)q(x))dx,D_{KL}(P \| Q) = \int_{\mathcal{X}} p(x) \log\left(\frac{p(x)}{q(x)}\right) dx,

or, equivalently, as an expected log-density ratio under PP. The metric is strictly non-negative, vanishes if and only if P=QP = Q almost everywhere, and is inherently asymmetric. In practical contexts, such as the ELBO for VAEs, this term imposes a learned distribution over latent variables close to a prescribed analytic prior.

Explicit Derivation for Multivariate Gaussians

For PN(μ1,Σ1)P \sim \mathcal{N}(\mu_1, \Sigma_1) and QN(μ2,Σ2)Q \sim \mathcal{N}(\mu_2, \Sigma_2), the paper decomposes the KL divergence as follows: QQ0 where QQ1 is the dimensionality, and QQ2 are positive definite covariance matrices. The derivation proceeds stepwise:

  • Term QQ3: The log-determinant ratio quantifies scaling differences between covariances.
  • Term QQ4: The trace and quadratic form blend variance and mean displacement contributions.
  • Term QQ5: Subtracts an identity-normalization offset proportional to dimension. Linearity of expectation and trace properties are critical in simplifying matrix-expectation terms.

Specialization to VAE Regularization

In VAEs, the prior is routinely chosen as QQ6, and the approximate posterior as QQ7. Substitution yields a markedly simplified expression: QQ8 This form is fully analytic with respect to the encoder's outputs, guaranteeing computational efficiency for both evaluation and gradient-based optimization. The explicit trace and log-determinant terms make transparent the contributions of uncertainty and information compression enforced by the KL regularizer.

Theoretical and Practical Implications

The closed-form KL divergence underpins the VAE objective's differentiability, ensuring that stochastic gradient descent remains feasible even in high dimensions and permitting rapid experimentation with architecture or distributional assumptions. The separation of mean and scale components in the loss function incentivizes both the centering of approximate posteriors and variance regularization, supporting disentanglement and robust generation.

On the theoretical side, this derivation highlights the precise roles of covariance interactions and expected Mahalanobis distances in shaping the information-theoretic penalty imposed on the encoder. It clarifies that, beyond informal intuition, every parameterization of the latent posterior has concrete, quantitatively specified effects on model learning dynamics and generalization.

Future Directions

Extending this derivation to non-diagonal or structured covariance models could enable richer approximate posteriors, enhancing expressiveness while maintaining analytic tractability for the KL term when possible. There are opportunities in determining the effects of more flexible priors on regularization dynamics, analytic gradients, and convergence behavior. Understanding these subtleties is crucial for robustly advancing generative modeling, especially in domains requiring structured latent representations or hierarchical modeling.

Conclusion

This paper delivers a detailed, stepwise account of the KL divergence between general multivariate Gaussians, including the special case foundational for VAE training objectives (2604.11744). By elucidating the derivation, it provides both theoretical clarity and practical tools for neural generative modeling, supporting principled latent space regularization and effective optimization.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.