Mean–KL Parameterization

Updated 25 February 2026

Mean–KL Parameterization is a framework that redefines probabilistic modeling by using mean parameters and explicit KL divergence to enhance model interpretability and robustness.
It reformulates conjugate priors and variational inference, eliminating costly annealing and streamlining hyperparameter selection in high-dimensional settings.
Natural gradient methods under this parameterization achieve invariant convergence rates, leading to faster optimization and improved stability in applications like neural network compression.

The Mean–KL parameterization is a framework for specifying probabilistic models, variational approximations, and optimization flows in terms of natural (mean) parameters and Kullback–Leibler (KL) divergence, rather than traditional variance or precision-based coordinates. By parameterizing in terms of means and explicitly controlling KL divergence, this approach yields models and optimization routines with improved interpretability, tighter control over informational budgets, and often superior convergence or robustness properties. It has been formulated for constructing conjugate priors for multivariate normal models (Brümmer, 2021), for variational Bayesian neural network compression (Lin et al., 2023), and for understanding gradient flows in information geometry (Datar et al., 27 Apr 2025).

1. Core Formulation: Mean–KL in Exponential and Gaussian Families

In the context of exponential families, the Mean–KL parameterization exploits the duality between natural (canonical) parameters $\theta$ and mean (mixture) parameters $\eta$ . For a distribution $p_\theta(x)$ of the form

$p_\theta(x) = \exp\bigl(\langle \theta, T(x) \rangle - A(\theta)\bigr),$

the dual mean parameter is $\eta=\mathbb{E}_{p_\theta}[T(x)]$ . The KL divergence between $p_\theta$ and a reference $q = p_{\theta_q}$ admits two equivalent Bregman divergences, one in $\theta$ (“exponential” coordinates) and one in $\eta$ (“mean” coordinates): $D_{KL}(p_\theta\|q) = A(\theta_q) - A(\theta) - \langle \eta, \theta_q - \theta \rangle = A^*(\eta) - A^*(\eta_q) - \langle \theta_q, \eta - \eta_q \rangle,$ where $A^*$ is the Legendre dual of the log-partition $A$ .

For multivariate Gaussians, Mean–KL parameterization directly uses the KL divergence: $D_{KL}\left(\mathcal{N}(\mu_0,\Sigma_0) \Vert \mathcal{N}(\mu,\Sigma)\right) = \frac{1}{2}\left[ \ln\frac{\det\Sigma}{\det\Sigma_0} - d + \mathrm{tr}(\Sigma^{-1}\Sigma_0) + (\mu-\mu_0)^T\Sigma^{-1}(\mu-\mu_0) \right]$ and uses it as an “energy function” for priors or variational distributions (Brümmer, 2021).

2. Construction of Conjugate Priors with Mean–KL Parameterization

In Bayesian models for the multivariate Gaussian, the traditional choice of conjugate priors (Wishart for precision; Normal–Wishart for unknown mean and covariance) suffers from difficulty in selecting hyperparameters and from pathological behavior in non-informative limits ( $\nu\downarrow d-1$ ). Mean–KL parameterization defines priors as

$\log P(\mu,\Sigma) = -\alpha\, D_{KL}\left(\mathcal{N}(\mu_0,\Sigma_0) \Vert \mathcal{N}(\mu,\Sigma)\right) + \mathrm{const}$

with KL scale parameter $\alpha>0$ interpreted as a “pseudocount.” When $\mu$ is known, the prior for precision $P=\Sigma^{-1}$ is Wishart: $P(P) = W\left(P\mid V=(\alpha\Sigma_0)^{-1},\, \nu = \alpha+d+1\right)$ and the prior mode is centered at $\Sigma_0^{-1}$ . When both $\mu$ and $P$ are unknown, the resulting prior is Normal–Wishart, and $\alpha$ can be reduced to $0$ without violating the constraints on degrees of freedom, guaranteeing a proper non-informative limit with intuitive “mode + pseudocount” interpretation for all hyperparameters (Brümmer, 2021).

3. Variational Inference and Neural Compression: Mean–KL Parameterization

In the context of variational Bayesian neural networks and minimal coding (MIRACLE), the Mean–KL parameterization defines each Gaussian variational posterior $Q_w = \mathcal{N}(\mu_w, \sigma_w^2)$ in terms of

a mean-shift parameter $\tau_w$ ,
and an information-quota parameter $\gamma_w$ such that $\sum_w \gamma_w = 1$ in each weight block,
with a total block information budget $\kappa$ , yielding per-weight KL budget $\kappa_w = \gamma_w \kappa$ .

The mean is $\mu_w = \nu + \rho\sqrt{2\kappa_w}\tanh(\tau_w)$ , and the unique $\sigma_w^2$ satisfying $D_{KL}\left(\mathcal{N}(\mu_w, \sigma_w^2) \Vert \mathcal{N}(\nu, \rho^2)\right) = \kappa_w$ is given in closed form using the Lambert W-function. This parameterization enforces the desired per-block KL constraint exactly, eliminating costly penalty annealing required in mean–variance parameterized schemes (Lin et al., 2023).

Training optimizes the expected loss over $Q_w$ , with only the constraints $\sum_w\gamma_w=1$ , $\gamma_w\geq 0$ (enforced via softmax). No explicit KL penalty appears in the loss—the KL is maintained by construction.

4. Optimization and Information Geometry: Gradient Dynamics under Mean–KL

In probabilistic machine learning, the choice of coordinate system (mean vs. exponential) has significant impact on the curvature of the loss landscape and, consequently, on the convergence of gradient-based methods. For minimizing $D_{KL}$ :

In natural (mean, $\eta$ ) coordinates, Euclidean gradient descent flows, $\eta_{k+1} = \eta_k - \alpha \nabla_\eta D_{KL}$ , can achieve arbitrarily fast convergence under affine rescaling (the local Hessian can be made arbitrarily large).
In exponential ( $\theta$ ) coordinates, the landscape can be made arbitrarily flat, resulting in slow gradient descent.
Natural gradient descent (NGD), which adapts the update to the Fisher–Rao metric (Hessian of $A$ ), has update $\widetilde{\nabla}_\theta D_{KL} = \theta-\theta_q$ ; this sets the condition number to 1 and fixes the continuous-time convergence rate to 2, robust to affine reparameterizations (Datar et al., 27 Apr 2025).

The table summarizes dynamics under various parameterizations:

Parameterization	Discrete Update	Continuous-time Convergence Rate
$\theta$ –GD (exp)	$\theta_{k+1} = \theta_k - \alpha(\eta_k-\eta_q)$	$<2$
$\eta$ –GD (mean)	$\eta_{k+1} = \eta_k - \alpha\nabla^2 A^*(\eta_k)(\eta_k-\eta_q)$	$>2$ locally
NGD (in $\eta$ )	$\eta_{k+1} = \eta_k + \alpha (\eta_q-\eta_k)$	$2$ (invariant)

Natural gradient admits the largest stable step sizes and is most robust to gradient noise, especially in discrete-time optimization (Datar et al., 27 Apr 2025).

5. Empirical Outcomes and Practical Implications

In Bayesian neural network compression (“Minimal Random Code Learning”):

Training with Mean–KL halves convergence time compared to mean–variance parameterization while achieving comparable or superior final test accuracy, especially at high compression (test error $\approx 0.8\%$ at $500$– $1000\times$ compression) (Lin et al., 2023).
KL is enforced exactly, with no need for annealing.
Variational posteriors under Mean–KL are characterized by heavier tails and more spread means (resembling Student-t or Laplace), contrasted with tight mean clumping and uniform high variance in Mean–Var posteriors.
Mean–KL compressed models demonstrate higher robustness to random and informed (KL-based) weight pruning, retaining up to 90% accuracy even after pruning 90% of weights, whereas baseline methods performance collapses under aggressive pruning.

In Bayesian estimation for Gaussian models, Mean–KL parameterization allows for a transparent interpretation of hyperparameters and a well-defined non-informative prior limit in both Wishart and Normal–Wishart cases. This avoids the conventional challenge of Wishart shape parameter selection and ensures that maximum a posteriori estimates recover maximum likelihood in the limit $\alpha\to 0$ (Brümmer, 2021).

6. Theoretical and Operational Advantages

Key advantages of the Mean–KL parameterization include:

Direct, interpretable hyperparameters (“mode + pseudocount”) for priors and variational distributions (Brümmer, 2021).
Exact enforcement of global or per-block KL budgets by construction in variational inference (Lin et al., 2023).
Elimination of unstable or resource-intensive annealing procedures.
Robustness under change of parameter scaling, with natural gradient methods maintaining invariant convergence rates across affine reparameterizations (Datar et al., 27 Apr 2025).
MAP solutions coincide with MLEs in the non-informative prior limit, aligning with objective Bayesian principles (Brümmer, 2021).

A plausible implication is that Mean–KL parameterization substantially simplifies both the analytical and practical aspects of model specification and optimization in high-dimensional probabilistic modeling.

7. Summary and Outlook

The Mean–KL parameterization unifies information-theoretic, geometric, and Bayesian perspectives by specifying models and variational families in terms of means and KL divergence to a reference distribution. It yields priors and inference schemes with more transparent hyperparameter semantics, well-posed non-informative limits, and, when combined with natural gradient methods, provably robust and fast optimization. Empirical evidence in neural coding and Bayesian estimation supports these claims (Brümmer, 2021, Lin et al., 2023, Datar et al., 27 Apr 2025). The approach is broadly applicable to exponential family models and suggests a direction for further research at the intersection of information geometry, variational inference, and scalable Bayesian computation.