Papers
Topics
Authors
Recent
Search
2000 character limit reached

Modified Symmetric KL Divergence (MSKL)

Updated 20 December 2025
  • Modified Symmetric KL (MSKL) divergence is a tractable measure that combines forward KL with a learned proxy for reverse KL to address sample-based estimation challenges.
  • It employs a normalizing flow as the main model and an energy-based proxy to facilitate joint optimization in a constrained, adaptive framework.
  • Empirical results demonstrate robust training with stable convergence, effective mode recovery, and competitive performance in density estimation and image generation.

The Modified Symmetric Kullback-Leibler (MSKL) divergence is a divergence measure designed to enable tractable symmetric training of probabilistic models from samples, by combining forward Kullback-Leibler (KL) divergence and a learned proxy for the intractable reverse KL. MSKL is especially pertinent when the data-generating distribution is only available through samples, making direct computation of the reverse KL with respect to the true data distribution infeasible. Through the introduction of a flexible yet tractable proxy model, MSKL allows for both forward and reverse KL to be optimized jointly in a constrained, adaptive fashion. This framework provides a learned, data-driven alternative to fixed-weight symmetric divergences or adversarial min-max formulations in generative modeling, and has demonstrated broad empirical utility across density estimation, image generation, and simulation-based inference (Ben-Dov et al., 14 Nov 2025).

1. Mathematical Definition and Motivation

Let π\pi denote the target data distribution, pθp_\theta be the main generative model (parameterized—for example—as a normalizing flow), and qψq_\psi be a proxy (auxiliary) model. The classic symmetric divergence to minimize is the Jeffreys divergence: DJ(π,pθ)=DKL(πpθ)+DKL(pθπ)D_{J}(\pi, p_\theta) = D_{\mathrm{KL}}(\pi\Vert p_\theta) + D_{\mathrm{KL}}(p_\theta\Vert\pi) For many applications, particularly in high dimensions, only samples from π\pi are accessible, so estimating the reverse KL DKL(pθπ)D_{\mathrm{KL}}(p_\theta\Vert \pi) is generally infeasible.

The MSKL divergence addresses this by introducing a proxy model qψq_\psi, resulting in the following divergence: DMSKL(pθ,qψ)=DKL(πpθ)+DKL(pθqψ)D_{\mathrm{MSKL}}(p_\theta, q_\psi) = D_{\mathrm{KL}}(\pi\Vert p_\theta) + D_{\mathrm{KL}}(p_\theta\Vert q_\psi) In this construction, qψq_\psi is trained both to fit π\pi and to serve as a tractable target for the reverse KL term from pθp_\theta. As qψq_\psi is adaptively aligned to π\pi, the second term becomes a faithful surrogate to the intractable DKL(pθπ)D_{\mathrm{KL}}(p_\theta\Vert \pi).

2. Model Architecture: Main Model and Proxy

The utility of MSKL derives from the complementary parameterizations of the main and proxy models:

  • Main Model pθp_\theta: Implemented as a normalizing flow, ensuring that logpθ(x)\log p_\theta(x) and exact i.i.d. sampling are tractable. The primary optimization target is the forward KL divergence to π\pi and the reverse KL to qψq_\psi.
  • Proxy Model qψq_\psi: Implemented as an energy-based model, where qψ(x)exp(fψ(x))q_\psi(x) \propto \exp(f_\psi(x)), allowing greater representational expressivity at the cost of normalization constant estimation. The proxy is optimized both to fit π\pi through forward KL and to act as the reverse KL anchor for pθp_\theta.

By constraining qψq_\psi to remain close to π\pi, DKL(pθqψ)D_{\mathrm{KL}}(p_\theta\Vert q_\psi) effectively acts as a practical approximation to DKL(pθπ)D_{\mathrm{KL}}(p_\theta\Vert\pi).

Model Parameterization Role
pθp_\theta Normalizing Flow (NF) Approximates and samples; minimizes DKL(πpθ)D_{\mathrm{KL}}(\pi\Vert p_\theta) and DKL(pθqψ)D_{\mathrm{KL}}(p_\theta\Vert q_\psi)
qψq_\psi Energy-Based Model (EBM) Fits π\pi and provides DKL(pθqψ)D_{\mathrm{KL}}(p_\theta\Vert q_\psi) target

3. Constrained Optimization Formulation

The MSKL training objective is formalized using constrained optimization to flexibly balance the contributions of each divergence term. Two principal formulations are introduced:

Proxy-Only Constrained Problem (P-Proxy):

minθ,ψDKL(πpθ)+DKL(pθqψ) subject toDKL(πqψ)ε,    h(pθ,qψ)c\begin{aligned} & \min_{\theta, \psi} D_{\mathrm{KL}}(\pi\Vert p_\theta) + D_{\mathrm{KL}}(p_\theta\Vert q_\psi) \ & \quad \text{subject to} \quad D_{\mathrm{KL}}(\pi\Vert q_\psi) \leq \varepsilon,\;\; h(p_\theta, q_\psi) \leq c \end{aligned}

Adaptive (Resilient) Symmetrization (P-DYN):

Introducing slack variables uf,ur,up0u_f, u_r, u_p \geq 0,

minθ,ψ,uf,ur,up0    uf2+ur2+up2 subject toDKL(πpθ)uf, DKL(pθqψ)ur, DKL(πqψ)up, h(pθ,qψ)c\begin{aligned} \min_{\theta, \psi, u_f, u_r, u_p \geq 0}\;\; & u_f^2 + u_r^2 + u_p^2 \ \text{subject to}\quad & D_{\mathrm{KL}}(\pi\Vert p_\theta) \leq u_f, \ & D_{\mathrm{KL}}(p_\theta\Vert q_\psi) \leq u_r, \ & D_{\mathrm{KL}}(\pi\Vert q_\psi) \leq u_p, \ & h(p_\theta, q_\psi) \leq c \end{aligned}

This adaptive approach replaces fixed trade-off weights with learnable slack variables, dynamically adjusting the relative strictness of each divergence constraint throughout optimization.

Dual variables (λf,λr,λp,λh)(\lambda_f, \lambda_r, \lambda_p, \lambda_h) are introduced to form the Lagrangian, resulting in an empirical dual problem: maxλ0minθ,ψ,u0L(θ,ψ,u,λ)\max_{\lambda \geq 0} \min_{\theta, \psi, u \geq 0} L(\theta, \psi, u, \lambda) where LL aggregates the main objective and constraint violations.

4. Training Algorithm

Optimization is performed by primal–dual gradient descent–ascent (GDA), alternating between parameter updates for (θ,ψ,u)(\theta, \psi, u) via gradient descent and multipliers λ\lambda via ascent. The procedure respects non-negativity for slack and dual variables through projection:

1
2
3
4
5
6
7
8
9
10
11
12
Input: learning rates α_θ, α_ψ, α_u, α_λ
Initialize θ, ψ, u={u_f,u_r,u_p}  0, λ={λ_f,λ_r,λ_p,λ_h}  0 
repeat 
    # 1) Sample minibatch {x_i} from π and {y_j} from p_θ
    #    For EBMs, estimate log Z of q_ψ by importance sampling: 
    #    y_j ∼ p_θ; log Z ≈ LogSumExp(f_ψ(y_j)-log p_θ(y_j)) - log M
    # 2) Compute L(θ,ψ,u,λ) and gradients
    θ  θ   α_θ · _θ L
    ψ  ψ   α_ψ · _ψ L
    u  max(0, u   α_u · _u L)
    λ  max(0, λ  + α_λ · _λ L)
until convergence
A closed-form update yields ui=12λiu_i = \frac{1}{2} \lambda_i at stationary points. In practice, forward KL terms are estimated via empirical averages (e.g., negative log-likelihood) over minibatches.

5. Theoretical Guarantees and Properties

  • Duality Gap Bound: Under mild conditions (e.g., Lipschitz continuity and universal approximation properties of the model classes), the gap between primal and dual optima is bounded:

0PDBν0 \leq P^\star - D^\star \leq B \nu

where ν\nu measures total-variation approximation error and BB is finite.

  • Gradient Forms:
    • θL\nabla_{\theta}L employs normalizing flow score gradients and the reverse KL to qψq_\psi.
    • ψL\nabla_{\psi}L incorporates samples from pθp_\theta and estimates for logZ(ψ)\nabla \log Z(\psi) via importance sampling.
    • λL\nabla_{\lambda}L directly corresponds to constraint violations.
  • Optimization Dynamics: Training stabilizes via joint cooperation between pθp_\theta and qψq_\psi, avoiding the adversarial instability typical of min-max setups as in GANs.
  • Convergence: Although the objectives are nonconvex and global convergence guarantees are absent, empirical results indicate robust and stable convergence in diverse settings.

6. Empirical Evaluation and Applications

The MSKL framework has been empirically benchmarked across the following regimes (Ben-Dov et al., 14 Nov 2025):

  • Synthetic 2D GMM (40 components): Adaptive MSKL achieves lower and more stable test negative log-likelihood (NLL) than fixed-weight baselines, maintaining partition function normalization (Z1Z \approx 1) for the EBM proxy.
  • Structured 2D Data (rings, moons, grid, spiral): MSKL consistently recovers all modes and attains lower held-out NLL compared to normalizing flow only or fixed-weight symmetric penalties, demonstrating robust mode coverage.
  • High-dimensional Image Latents (CelebA, 100-d CAE latent): MSKL achieves Fréchet Inception Distance (FID) comparable to baseline NFs (∼48), indicating preservation of generative sample quality through the NF-EBM collaboration.
  • Simulation-Based Inference (two-moons, GMM SBI): Posterior samples become statistically indistinguishable from ground truth (as per C2ST ≈ 0.5), with fewer simulator calls than forward-KL-only baselines.
Task Baseline MSKL Outcome
2D GMM (40 comp.) NF, fixed-weights Lower NLL, stable normalization
Manifold 2D (rings, moons, grid, spiral) NF, fixed-weights All modes, lowest held-out NLL
CelebA 100-d latent (FID) NF Matches baseline FID (∼48)
Simulation-Based Inference (SBI) NF Statistically indistinguishable posteriors

Across scenarios, the adaptive MSKL mechanism consistently yields more robust and stable results than fixed penalty weights or adversarial min-max methods, supporting its applicability to density estimation, image generation, and likelihood-free inference.

7. Significance and Practical Considerations

MSKL provides a systematic and tractable methodology for symmetrizing statistical divergences via learned adaptation, circumventing the intractability of reverse KL terms when the data distribution is accessible only through samples. The introduction of a proxy model, parameterized as a flexible EBM and trained in concert with a tractable generative model (NF), enables approximate Jeffreys divergence minimization without adversarial instability or brittle penalty weighting. The primal–dual GDA training procedure further grounds the approach in constrained optimization, offering theoretical guarantees on duality gap under reasonable assumptions and demonstrating practical stability across diverse experimental regimes (Ben-Dov et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modified Symmetric Kullback-Leibler (MSKL) Divergence.