Proxy KL-Divergence Loss

Updated 16 March 2026

Proxy KL-divergence Loss is a surrogate function that replaces direct KL computations with tractable alternatives to stabilize training.
It is employed in various models such as flow matching, reinforcement learning, and generative models to provide unbiased or bounded gradient estimates.
Using proxy losses improves empirical stability, sample efficiency, and optimization control while mitigating issues inherent in direct KL minimization.

A proxy KL-divergence loss is any loss function that serves as a computational or statistical surrogate for the intractable or unstable direct measurement of Kullback–Leibler (KL) divergence between probability distributions in learning algorithms. Such losses provide gradients or optimization targets that—in expectation or under suitable conditions—approximately or provably control the true KL divergence of interest. Proxy KL-divergence losses are fundamental in modern machine learning, with wide-ranging applications in generative modeling, reinforcement learning (RL), signal and image processing, and representation learning. Their design, analysis, and empirical properties have been the subject of extensive research, both for their statistical convergence guarantees and as a core mechanism to mitigate pathologies of direct KL minimization.

1. Theoretical Principles of Proxy KL-Divergence Losses

The essential principle underlying a proxy KL-divergence loss is to replace the direct but intractable or statistically unreliable KL term $\mathrm{KL}(P\|Q)$ with an alternative, efficiently computable objective whose minimization ensures bounded deviation from the true KL. This may arise due to:

Inaccessibility of one density (e.g., only samples from $P$ , density for $Q$ )
Unstable gradient estimation for certain models (e.g., vanishing/exploding gradients)
Need for unbiased, tractable gradients for high-dimensional parameter spaces

A key result formalizing this paradigm is found in flow matching. Given a true path of distributions $p_t$ generated by a velocity field $v(x,t)$ and a learned velocity $\hat v(x,t)$ yielding model marginals $q_t$ , the $L_2$ flow-matching loss,

$\varepsilon^2 = \E_{t\sim U[0,1], x\sim p_t}\|\hat v(x,t)-v(x,t)\|^2_2,$

satisfies the bound:

$\mathrm{KL}(p_1\|\hat{Q}) \leq A_1 \varepsilon + A_2 \varepsilon^2$

where $A_1,A_2$ encode regularity of the data and velocity (Su et al., 7 Nov 2025). Thus, minimizing the proxy loss $\varepsilon^2$ achieves statistical control of $\mathrm{KL}(P\|\hat Q)$ as training progresses. Analogous surrogate losses appear in diffusion models (score matching), density ratio estimation ( $\alpha$ -divergence vs. KL), contrastive representation learning, and others.

2. Classes and Construction of Proxy KL-Divergence Losses

Proxy KL-divergence losses arise in multiple modeling paradigms:

Model Class	Proxy Loss	Guarantees or Function
Neural ODE/flow matching	$L_2$ flow-matching error	Bounds true KL divergence readily; deterministic control
Distributional RL	Surrogate cross-entropy, REINFORCE-style losses with baselines	Unbiased forward-KL gradient estimation for policies
Neural density ratio estimation	$\alpha$ -Div loss (bounded $f$ -divergence, $0 < \alpha < 1$ )	Bounded variance, non-vanishing gradients, precise RMSE
Image restoration/representation	Closed-form KL between Gaussianized features, empirical hist KL on deep maps	Global distribution matching in frequency/perceptual domains
Symmetric/f-divergence learning	Proxy model $r_\varphi$ for intractable reverse KL in Jeffreys divergence	Symmetric divergence minimization with constraints

For instance, in flow matching, the $L_2$ velocity field error is a deterministic surrogate whose training directly controls the final KL divergence (Su et al., 7 Nov 2025). In signal processing, KL between empirical Gaussians of amplitude/phase in Fourier space provides a global, distributional proxy for frequency alignment (Xingyang et al., 16 Sep 2025).

3. Proxy KL-Divergence Losses in Practice: Implementation, Stability, and Limitations

The empirical success and stability of proxy KL-divergence losses depend on bias, sample complexity, tractability, and their expressiveness relative to data and model regularity. Examples:

In RLHF for LLMs, surrogate losses for KL regularization must include the score function term:

$\E_{y\sim\pi}\Big[ \big(\log\pi(y)-\log\pi_\mathrm{ref}(y)\big)\,\nabla_\theta \log\pi_\theta(y) \Big]$

Implementing only the naïve path-wise gradient (e.g., autodiff through sampled KL) omits crucial terms, resulting in zero or biased gradients. Correct estimators use cumulative or sequence-wise surrogates for unbiased forward-KL gradients (Tang et al., 11 Jun 2025).

In representation learning, cross-entropy (KL) losses can display instability due to the unboundedness and asymmetry of KL. Proxy losses based on bounded $f$ -divergences (e.g., total variation, Jensen–Shannon) or carefully constructed kernel-based losses yield empirically tighter clustering, more stable training, and improved downstream metrics (Shone et al., 5 Sep 2025).
In neural density ratio estimation, variational objectives based on $\alpha$ -divergence for $0<\alpha<1$ provide unbiased, stable, bounded proxy losses compared to direct KL, especially in the presence of extreme density ratios or model mismatches (Kitazawa, 2024).

Implementation recipes are context-dependent but follow a common principle: replace the classical sample-based or plug-in KL estimator by a surrogate loss whose empirical minimization provably or empirically ensures proximity to the desired statistical divergence.

4. Statistical and Optimization Guarantees

The principal advantage of proxy KL-divergence losses is the provable control of the true divergence in the limit of vanishing proxy loss and under assumptions on model/data regularity.

Flow Matching KL Bound: For flow-matching-based generative models, statistical convergence in total variation (via Pinsker’s inequality) and minimax-optimal efficiency are guaranteed by deterministic bounds connecting the $L_2$ flow-matching loss to the KL divergence (Su et al., 7 Nov 2025).
Density Ratio Estimation: The $\alpha$ -divergence proxy reduces sample complexity, ensures stable, non-vanishing gradients, and retains asymptotic unbiasedness, avoiding the exponential variance scaling of sample-based KL estimators at large divergence (Kitazawa, 2024).

Proxy losses are particularly effective when the exact KL is inaccessible, ill-behaved, or fails to provide adequate optimization geometry. However, the empirical performance can depend on careful hyperparameter choices (e.g., $\alpha$ , temperature in kernels), and task alignment must be monitored, as minimizing a proxy divergence does not guarantee optimality for all downstream metrics (Shone et al., 5 Sep 2025).

5. Extensions to Symmetrized and Generalized Divergences

Direct minimization of symmetric divergences such as the Jeffreys (symmetrized KL) is often intractable. Proxy KL-divergence constructions introduce a learnable reference or "proxy" model, converting an intractable reverse KL into a tractable forward KL against the proxy:

Model $q_\theta$ learns via penalized or constrained minimization:

$\begin{aligned} & \mathrm{KL}(p_\mathrm{data}\|q_\theta) + \mathrm{KL}(q_\theta\|r_\varphi) \ & \text{subject to } \mathrm{KL}(p_\mathrm{data}\|r_\varphi) \leq \varepsilon \end{aligned}$

with $r_\varphi$ trained to fit $p_\mathrm{data}$ , and $q_\theta$ regularized via $r_\varphi$ (Ben-Dov et al., 14 Nov 2025).

This framework automatically balances mode-seeking (reverse KL) and mode-covering (forward KL), leading to stable optimization without adversarial tricks, and provably closes duality gaps under rich model classes.

6. Proxy KL-Divergence Losses in Discriminative and Generative Learning

Generalizing the cross-entropy loss to arbitrary $f$ -divergences yields a family of Fenchel–Young losses:

$\ell_f(\theta, y; q) = \mathrm{softmax}_f(\theta; q) - \langle y, \theta \rangle + D_f(y\|q)$

with associated $f$ -softargmax operator. For $f(u) = u\log u$ one recovers the classical logistic/cross-entropy loss (KL divergence), but Tsallis or other $f$ lead to alternative proxy losses with improved empirical performance in certain tasks, e.g., next-token prediction and distillation with $\alpha=1.5$ (Roulet et al., 30 Jan 2025).

In representation learning, replacing KL with TV, JSD, or Hellinger divergence in the loss grants both theoretical stability and substantial empirical gains, especially with suitable similarity kernels (Shone et al., 5 Sep 2025).

7. Empirical Outcomes and Comparative Evaluations

Empirical studies consistently show that proxy KL-divergence losses can outperform direct KL-based objectives when the latter are unstable, intractable, or misaligned with the desired outcome. Major findings include:

Image frequency/stuctural alignment: Fourier-domain or perceptual-vector KL losses (proxying global distributional shifts) surpass pixel-wise MSE or $L_1$ in perceptual and quantitative quality (Xingyang et al., 16 Sep 2025).
Density estimation/Generative models: Proxy-based symmetrized KL losses yield lower NLL, recover multimodal distributions more consistently, and support more aggressive learning rates without collapse (Ben-Dov et al., 14 Nov 2025).
Contrastive/Clustering/Dimensionality reduction: TV and JSD proxies avoid "crowding" and sensitivity to missing neighbor mass, improving cluster purity and downstream accuracy beyond KL (Shone et al., 5 Sep 2025).
Language modeling and SFT/distillation: Fenchel–Young losses from Tsallis $\alpha$ -divergence with $\alpha=1.5$ marginally but consistently outperform cross-entropy (Roulet et al., 30 Jan 2025).

In conclusion, proxy KL-divergence losses constitute a rigorously-understood, widely adopted, and technically versatile class of objectives enabling stable, tractable, and effective approximation of KL divergence in diverse machine learning contexts. Their construction and selection depend on statistical principles, computational tractability, and alignment with downstream metrics and optimization criteria.