Regularized KKL: Skewed/Smoothed Variant

Updated 8 March 2026

Regularized KKL is a framework that extends classic KL divergence by incorporating convex mixtures to handle non-overlapping supports in kernel spaces.
It enables closed-form computations for empirical point clouds and offers finite sample consistency by balancing bias and variance through the regularization parameter.
In nonlinear control, the method informs observer design using nonlinear contraction dynamics that improve convergence speed and reduce noise sensitivity.

Regularized KKL (Skewed/Smoothed Variant) refers to regularization mechanisms applied to generalizations of the Kullback-Leibler (KL) divergence, as developed in the context of kernel methods and quantum operator theory, and more broadly, to the design of nonlinear observers for nonlinear dynamical systems where regularization is induced via contraction and nonlinearities in observer dynamics. These regularized KKL methods appear prominently in two domains: (1) statistical learning, where they extend KL-type divergences to work robustly for non-overlapping supports and discrete measures; and (2) nonlinear control, where they yield observer architectures balancing speed and robustness to noise. The term "regularized" (as well as "skewed" or "smoothed") highlights the introduction of additional convex combinations or nonlinear dynamical terms to ensure desirable mathematical and statistical properties for both learning and filtering scenarios.

1. Kernel Kullback-Leibler Divergence and Its Regularization

The kernel Kullback-Leibler (KKL) divergence generalizes the classic KL divergence by replacing density ratios with operator-level comparisons of covariance embeddings in a reproducing kernel Hilbert space (RKHS). For probability distributions $p, q$ on $\mathbb{R}^d$ and a kernel $k$ with feature map $\phi$ , the respective kernel covariance operators are $\Sigma_p = \int \phi(x) \otimes \phi(x)\,dp(x)$ . The original KKL divergence is defined as

$\mathrm{KKL}(p\|q) = \mathrm{Tr}[ \Sigma_p ( \log \Sigma_p - \log \Sigma_q ) ].$

This operator-level quantity, also known as the "quantum KL" divergence, is only finite if the support of $p$ is absolutely continuous with respect to $q$ . When $\operatorname{supp}(p) \not\subset \operatorname{supp}(q)$ , $\mathrm{KKL}$ is infinite (Chazal et al., 2024).

Regularization is introduced through a "skewed" or "smoothed" version to remedy this shortcoming. For $\alpha \in (0,1)$ , one considers a convex mixture of $q$ and $p$ : $q_{\alpha} = (1-\alpha)q + \alpha p, \qquad \Sigma_{q_\alpha} = (1-\alpha)\Sigma_q + \alpha \Sigma_p,$ which yields the regularized KKL divergence: $\mathrm{KKL}_{\alpha}(p\|q) := \mathrm{KKL}(p\|q_{\alpha}) = \mathrm{Tr}[ \Sigma_p \log \Sigma_p ] - \mathrm{Tr}[ \Sigma_p \log((1-\alpha)\Sigma_q + \alpha \Sigma_p) ].$ This ensures that $\mathrm{KKL}_\alpha$ is defined for all distributions, even with disjoint supports, by guaranteeing full-rank overlap of covariance operators.

2. Closed-Form Expressions for Point Clouds

For empirical measures composed of finite point sets, $p = \frac{1}{n}\sum_{i=1}^n \delta_{x_i}$ , $q = \frac{1}{m}\sum_{j=1}^m \delta_{y_j}$ , one forms Gram matrices:

$K_{xx} \in \mathbb{R}^{n\times n}$ , $K_{yy} \in \mathbb{R}^{m\times m}$ , $K_{xy} \in \mathbb{R}^{n\times m}$ with $K_{xx,ij} = k(x_i, x_j)$ , $K_{yy,ij} = k(y_i, y_j)$ , $K_{xy,ij} = k(x_i, y_j)$ .
The mixed Gram matrix $K \in \mathbb{R}^{(n+m) \times (n+m)}$ assembles the covariance structure across the combined sample.

The regularized divergence for empirical measures admits a matrix trace formula: $\mathrm{KKL}_\alpha(\hat{p} \| \hat{q}) = \mathrm{Tr}\left[\frac{1}{n} \log\left(\frac{1}{n}\right)\right] - \mathrm{Tr}[ I_\alpha K \log K ]$ where $I_\alpha$ is block-diagonal with $1/\alpha$ on the $n \times n$ block and zeros elsewhere. This enables $\mathrm{KKL}_\alpha$ to be computed in closed form in $O((n+m)^3)$ time via diagonalization, making practical implementation viable for moderate sample sizes (Chazal et al., 2024).

3. Theoretical Properties: Consistency and Geometry

The regularization parameter $\alpha$ directly controls the interpolation between strict KKL and degeneracy:

As $\alpha \to 0$ , $\mathrm{KKL}_\alpha \to \mathrm{KKL}$ at rate $O(\alpha)$ . Sharp upper bounds quantify this deviation when $p$ is absolutely continuous with respect to $q$ and the density ratio is bounded.
Finite sample bounds scale as $1/(\alpha\sqrt{n})$ for empirical means and as $O((\log n)^2/(\alpha n))$ for higher moments, making choice of $\alpha$ critical for bias-variance trade-off in finite-sample settings.

Geometrically, $\mathrm{KKL}_\alpha$ can be interpreted as a kernel-smoothed quantum/standard KL functional—lying between the standard KL divergence and the natural kernel-smoothing of $p$ and $q$ . In Wasserstein geometry, this regularization ensures $\mathrm{KKL}_\alpha$ is smooth for discrete distributions (finite-rank operators), allowing well-posed gradient flows and optimization (Chazal et al., 2024).

4. Wasserstein Gradient Descent: Optimization via Regularized KKL

Gradient flows of $\mathrm{KKL}_\alpha$ can be computed explicitly in the Wasserstein metric for empirical measures: $\frac{\partial}{\partial t} p_t + \nabla \cdot \left( p_t \nabla F'(p_t) \right) = 0$ where $F(p) = \mathrm{KKL}_\alpha(p\|q)$ and $F'(p)$ has closed form involving kernel evaluations and eigendecomposition of $K$ . The time-discretized push-forward update for particles,

$x_i^{\ell+1} = x_i^\ell - \gamma \nabla_x F'(\hat{p}_\ell)(x_i^\ell)$

defines a method akin to SVGD, with the regularized divergence providing the objective landscape.

Efficient implementation recommendations:

Precompute Gram matrices and diagonalize $K$ per iteration.
For moderate $n$ , the computational cost is dominated by eigendecomposition.
Choice of $\alpha \gg 1/\sqrt{n}$ controls variance.
Quasi-Newton (L-BFGS) and auto-differentiation can accelerate convergence and gradient computation (Chazal et al., 2024).

5. Nonlinear Contracting Regularization in Observer Design

In nonlinear observer theory, regularized KKL frameworks refer to the replacement of linear filters with nonlinear, contracting dynamics—using scalar contraction kernels—to improve robustness and convergence. Here, the system is assumed strongly differentially observable; the observer state $z\in\mathbb{R}^m$ evolves via

$\dot{z} = k\,\Sigma(z, y) \quad \text{with} \quad \Sigma(z, y) = [\lambda_0 \sigma(z_0, y), \ldots, \lambda_{m-1} \sigma(z_{m-1}, y)]^\top$

for $k>0$ and contraction kernel $\sigma$ ensuring strong contraction in $z$ .

The key regularization mechanism:

The nonlinearity of $\sigma$ is designed so that for large residuals, the observer exhibits a fast correction ("a_fast" gain); for small errors, the response is slow ("a_slow" gain), improving noise rejection.
A concrete form,

$\sigma(u, y) = a_\text{fast}(u − y) + (a_\text{slow}−a_\text{fast})\;\tanh(u-y)$

achieves this interpolating gain (Pachy et al., 2024).

6. Numerical Illustration and Practical Considerations

Demonstrations on the Duffing oscillator confirm that regularized (nonlinear) KKL observers combine (i) convergence speeds superior to slow-linear designs and (ii) noise sensitivity close to slow-linear but much improved over fast-linear filters. For comparison, mean convergence times for error reduction were 0.83s (fast-linear KKL), 6.79s (slow-linear), 2.27s (nonlinear KKL); mean noise gain factors were 7.57 (fast-linear), 1.15 (slow-linear), 1.95 (nonlinear).

Recommendations for practical use include tuning $\alpha$ or contraction gains to match data or plant characteristics, precomputing required operator quantities, and using offline calculations or gradient approaches appropriate for available computational resources (Chazal et al., 2024, Pachy et al., 2024).

7. Connections and Distinctive Features

Regularized KKL divergences and observers provide a unified approach to addressing limitations of both classical and kernel-based divergences in statistics and observer/filter designs in control:

In statistical learning, they address support mismatch and variance explosion, ensuring well-posedness for discrete, empirical, or non-overlapping distributions.
In nonlinear filtering, regularized nonlinear dynamics allow simultaneous acceleration of convergence and attenuation of noise, unattainable with linear gain choices.
Both frameworks exploit the smoothing/regularizing effects of convex mixing (in the divergence case) or nonlinear contraction (in the observer case), and can be implemented efficiently for finite samples or state dimensions.

The development of these regularized KKL frameworks has established robust methodologies for both statistical comparison of probability distributions and the design of nonlinear observers, with the dual effect of enhancing both theoretical properties and practical applicability (Chazal et al., 2024, Pachy et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Statistical and Geometrical properties of regularized Kernel Kullback-Leibler divergence (2024)

On the existence of KKL observers with nonlinear contracting dynamics (Long Version) (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Regularized KKL (Skewed/Smoothed Variant).