Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mean–KL Parameterization

Updated 25 February 2026
  • Mean–KL Parameterization is a framework that redefines probabilistic modeling by using mean parameters and explicit KL divergence to enhance model interpretability and robustness.
  • It reformulates conjugate priors and variational inference, eliminating costly annealing and streamlining hyperparameter selection in high-dimensional settings.
  • Natural gradient methods under this parameterization achieve invariant convergence rates, leading to faster optimization and improved stability in applications like neural network compression.

The Mean–KL parameterization is a framework for specifying probabilistic models, variational approximations, and optimization flows in terms of natural (mean) parameters and Kullback–Leibler (KL) divergence, rather than traditional variance or precision-based coordinates. By parameterizing in terms of means and explicitly controlling KL divergence, this approach yields models and optimization routines with improved interpretability, tighter control over informational budgets, and often superior convergence or robustness properties. It has been formulated for constructing conjugate priors for multivariate normal models (Brümmer, 2021), for variational Bayesian neural network compression (Lin et al., 2023), and for understanding gradient flows in information geometry (Datar et al., 27 Apr 2025).

1. Core Formulation: Mean–KL in Exponential and Gaussian Families

In the context of exponential families, the Mean–KL parameterization exploits the duality between natural (canonical) parameters θ\theta and mean (mixture) parameters η\eta. For a distribution pθ(x)p_\theta(x) of the form

pθ(x)=exp(θ,T(x)A(θ)),p_\theta(x) = \exp\bigl(\langle \theta, T(x) \rangle - A(\theta)\bigr),

the dual mean parameter is η=Epθ[T(x)]\eta=\mathbb{E}_{p_\theta}[T(x)]. The KL divergence between pθp_\theta and a reference q=pθqq = p_{\theta_q} admits two equivalent Bregman divergences, one in θ\theta (“exponential” coordinates) and one in η\eta (“mean” coordinates): DKL(pθq)=A(θq)A(θ)η,θqθ=A(η)A(ηq)θq,ηηq,D_{KL}(p_\theta\|q) = A(\theta_q) - A(\theta) - \langle \eta, \theta_q - \theta \rangle = A^*(\eta) - A^*(\eta_q) - \langle \theta_q, \eta - \eta_q \rangle, where AA^* is the Legendre dual of the log-partition AA.

For multivariate Gaussians, Mean–KL parameterization directly uses the KL divergence: DKL(N(μ0,Σ0)N(μ,Σ))=12[lndetΣdetΣ0d+tr(Σ1Σ0)+(μμ0)TΣ1(μμ0)]D_{KL}\left(\mathcal{N}(\mu_0,\Sigma_0) \Vert \mathcal{N}(\mu,\Sigma)\right) = \frac{1}{2}\left[ \ln\frac{\det\Sigma}{\det\Sigma_0} - d + \mathrm{tr}(\Sigma^{-1}\Sigma_0) + (\mu-\mu_0)^T\Sigma^{-1}(\mu-\mu_0) \right] and uses it as an “energy function” for priors or variational distributions (Brümmer, 2021).

2. Construction of Conjugate Priors with Mean–KL Parameterization

In Bayesian models for the multivariate Gaussian, the traditional choice of conjugate priors (Wishart for precision; Normal–Wishart for unknown mean and covariance) suffers from difficulty in selecting hyperparameters and from pathological behavior in non-informative limits (νd1\nu\downarrow d-1). Mean–KL parameterization defines priors as

logP(μ,Σ)=αDKL(N(μ0,Σ0)N(μ,Σ))+const\log P(\mu,\Sigma) = -\alpha\, D_{KL}\left(\mathcal{N}(\mu_0,\Sigma_0) \Vert \mathcal{N}(\mu,\Sigma)\right) + \mathrm{const}

with KL scale parameter α>0\alpha>0 interpreted as a “pseudocount.” When μ\mu is known, the prior for precision P=Σ1P=\Sigma^{-1} is Wishart: P(P)=W(PV=(αΣ0)1,ν=α+d+1)P(P) = W\left(P\mid V=(\alpha\Sigma_0)^{-1},\, \nu = \alpha+d+1\right) and the prior mode is centered at Σ01\Sigma_0^{-1}. When both μ\mu and PP are unknown, the resulting prior is Normal–Wishart, and α\alpha can be reduced to $0$ without violating the constraints on degrees of freedom, guaranteeing a proper non-informative limit with intuitive “mode + pseudocount” interpretation for all hyperparameters (Brümmer, 2021).

3. Variational Inference and Neural Compression: Mean–KL Parameterization

In the context of variational Bayesian neural networks and minimal coding (MIRACLE), the Mean–KL parameterization defines each Gaussian variational posterior Qw=N(μw,σw2)Q_w = \mathcal{N}(\mu_w, \sigma_w^2) in terms of

  • a mean-shift parameter τw\tau_w,
  • and an information-quota parameter γw\gamma_w such that wγw=1\sum_w \gamma_w = 1 in each weight block,
  • with a total block information budget κ\kappa, yielding per-weight KL budget κw=γwκ\kappa_w = \gamma_w \kappa.

The mean is μw=ν+ρ2κwtanh(τw)\mu_w = \nu + \rho\sqrt{2\kappa_w}\tanh(\tau_w), and the unique σw2\sigma_w^2 satisfying DKL(N(μw,σw2)N(ν,ρ2))=κwD_{KL}\left(\mathcal{N}(\mu_w, \sigma_w^2) \Vert \mathcal{N}(\nu, \rho^2)\right) = \kappa_w is given in closed form using the Lambert W-function. This parameterization enforces the desired per-block KL constraint exactly, eliminating costly penalty annealing required in mean–variance parameterized schemes (Lin et al., 2023).

Training optimizes the expected loss over QwQ_w, with only the constraints wγw=1\sum_w\gamma_w=1, γw0\gamma_w\geq 0 (enforced via softmax). No explicit KL penalty appears in the loss—the KL is maintained by construction.

4. Optimization and Information Geometry: Gradient Dynamics under Mean–KL

In probabilistic machine learning, the choice of coordinate system (mean vs. exponential) has significant impact on the curvature of the loss landscape and, consequently, on the convergence of gradient-based methods. For minimizing DKLD_{KL}:

  • In natural (mean, η\eta) coordinates, Euclidean gradient descent flows, ηk+1=ηkαηDKL\eta_{k+1} = \eta_k - \alpha \nabla_\eta D_{KL}, can achieve arbitrarily fast convergence under affine rescaling (the local Hessian can be made arbitrarily large).
  • In exponential (θ\theta) coordinates, the landscape can be made arbitrarily flat, resulting in slow gradient descent.
  • Natural gradient descent (NGD), which adapts the update to the Fisher–Rao metric (Hessian of AA), has update ~θDKL=θθq\widetilde{\nabla}_\theta D_{KL} = \theta-\theta_q; this sets the condition number to 1 and fixes the continuous-time convergence rate to 2, robust to affine reparameterizations (Datar et al., 27 Apr 2025).

The table summarizes dynamics under various parameterizations:

Parameterization Discrete Update Continuous-time Convergence Rate
θ\theta–GD (exp) θk+1=θkα(ηkηq)\theta_{k+1} = \theta_k - \alpha(\eta_k-\eta_q) <2<2
η\eta–GD (mean) ηk+1=ηkα2A(ηk)(ηkηq)\eta_{k+1} = \eta_k - \alpha\nabla^2 A^*(\eta_k)(\eta_k-\eta_q) >2>2 locally
NGD (in η\eta) ηk+1=ηk+α(ηqηk)\eta_{k+1} = \eta_k + \alpha (\eta_q-\eta_k) $2$ (invariant)

Natural gradient admits the largest stable step sizes and is most robust to gradient noise, especially in discrete-time optimization (Datar et al., 27 Apr 2025).

5. Empirical Outcomes and Practical Implications

In Bayesian neural network compression (“Minimal Random Code Learning”):

  • Training with Mean–KL halves convergence time compared to mean–variance parameterization while achieving comparable or superior final test accuracy, especially at high compression (test error 0.8%\approx 0.8\% at $500$–1000×1000\times compression) (Lin et al., 2023).
  • KL is enforced exactly, with no need for annealing.
  • Variational posteriors under Mean–KL are characterized by heavier tails and more spread means (resembling Student-t or Laplace), contrasted with tight mean clumping and uniform high variance in Mean–Var posteriors.
  • Mean–KL compressed models demonstrate higher robustness to random and informed (KL-based) weight pruning, retaining up to 90% accuracy even after pruning 90% of weights, whereas baseline methods performance collapses under aggressive pruning.

In Bayesian estimation for Gaussian models, Mean–KL parameterization allows for a transparent interpretation of hyperparameters and a well-defined non-informative prior limit in both Wishart and Normal–Wishart cases. This avoids the conventional challenge of Wishart shape parameter selection and ensures that maximum a posteriori estimates recover maximum likelihood in the limit α0\alpha\to 0 (Brümmer, 2021).

6. Theoretical and Operational Advantages

Key advantages of the Mean–KL parameterization include:

  • Direct, interpretable hyperparameters (“mode + pseudocount”) for priors and variational distributions (Brümmer, 2021).
  • Exact enforcement of global or per-block KL budgets by construction in variational inference (Lin et al., 2023).
  • Elimination of unstable or resource-intensive annealing procedures.
  • Robustness under change of parameter scaling, with natural gradient methods maintaining invariant convergence rates across affine reparameterizations (Datar et al., 27 Apr 2025).
  • MAP solutions coincide with MLEs in the non-informative prior limit, aligning with objective Bayesian principles (Brümmer, 2021).

A plausible implication is that Mean–KL parameterization substantially simplifies both the analytical and practical aspects of model specification and optimization in high-dimensional probabilistic modeling.

7. Summary and Outlook

The Mean–KL parameterization unifies information-theoretic, geometric, and Bayesian perspectives by specifying models and variational families in terms of means and KL divergence to a reference distribution. It yields priors and inference schemes with more transparent hyperparameter semantics, well-posed non-informative limits, and, when combined with natural gradient methods, provably robust and fast optimization. Empirical evidence in neural coding and Bayesian estimation supports these claims (Brümmer, 2021, Lin et al., 2023, Datar et al., 27 Apr 2025). The approach is broadly applicable to exponential family models and suggests a direction for further research at the intersection of information geometry, variational inference, and scalable Bayesian computation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mean–KL Parameterization.