Papers
Topics
Authors
Recent
Search
2000 character limit reached

Regularized KKL: Skewed/Smoothed Variant

Updated 8 March 2026
  • Regularized KKL is a framework that extends classic KL divergence by incorporating convex mixtures to handle non-overlapping supports in kernel spaces.
  • It enables closed-form computations for empirical point clouds and offers finite sample consistency by balancing bias and variance through the regularization parameter.
  • In nonlinear control, the method informs observer design using nonlinear contraction dynamics that improve convergence speed and reduce noise sensitivity.

Regularized KKL (Skewed/Smoothed Variant) refers to regularization mechanisms applied to generalizations of the Kullback-Leibler (KL) divergence, as developed in the context of kernel methods and quantum operator theory, and more broadly, to the design of nonlinear observers for nonlinear dynamical systems where regularization is induced via contraction and nonlinearities in observer dynamics. These regularized KKL methods appear prominently in two domains: (1) statistical learning, where they extend KL-type divergences to work robustly for non-overlapping supports and discrete measures; and (2) nonlinear control, where they yield observer architectures balancing speed and robustness to noise. The term "regularized" (as well as "skewed" or "smoothed") highlights the introduction of additional convex combinations or nonlinear dynamical terms to ensure desirable mathematical and statistical properties for both learning and filtering scenarios.

1. Kernel Kullback-Leibler Divergence and Its Regularization

The kernel Kullback-Leibler (KKL) divergence generalizes the classic KL divergence by replacing density ratios with operator-level comparisons of covariance embeddings in a reproducing kernel Hilbert space (RKHS). For probability distributions p,qp, q on Rd\mathbb{R}^d and a kernel kk with feature map ϕ\phi, the respective kernel covariance operators are Σp=ϕ(x)ϕ(x)dp(x)\Sigma_p = \int \phi(x) \otimes \phi(x)\,dp(x). The original KKL divergence is defined as

KKL(pq)=Tr[Σp(logΣplogΣq)].\mathrm{KKL}(p\|q) = \mathrm{Tr}[ \Sigma_p ( \log \Sigma_p - \log \Sigma_q ) ].

This operator-level quantity, also known as the "quantum KL" divergence, is only finite if the support of pp is absolutely continuous with respect to qq. When supp(p)⊄supp(q)\operatorname{supp}(p) \not\subset \operatorname{supp}(q), KKL\mathrm{KKL} is infinite (Chazal et al., 2024).

Regularization is introduced through a "skewed" or "smoothed" version to remedy this shortcoming. For α(0,1)\alpha \in (0,1), one considers a convex mixture of qq and pp: qα=(1α)q+αp,Σqα=(1α)Σq+αΣp,q_{\alpha} = (1-\alpha)q + \alpha p, \qquad \Sigma_{q_\alpha} = (1-\alpha)\Sigma_q + \alpha \Sigma_p, which yields the regularized KKL divergence: KKLα(pq):=KKL(pqα)=Tr[ΣplogΣp]Tr[Σplog((1α)Σq+αΣp)].\mathrm{KKL}_{\alpha}(p\|q) := \mathrm{KKL}(p\|q_{\alpha}) = \mathrm{Tr}[ \Sigma_p \log \Sigma_p ] - \mathrm{Tr}[ \Sigma_p \log((1-\alpha)\Sigma_q + \alpha \Sigma_p) ]. This ensures that KKLα\mathrm{KKL}_\alpha is defined for all distributions, even with disjoint supports, by guaranteeing full-rank overlap of covariance operators.

2. Closed-Form Expressions for Point Clouds

For empirical measures composed of finite point sets, p=1ni=1nδxip = \frac{1}{n}\sum_{i=1}^n \delta_{x_i}, q=1mj=1mδyjq = \frac{1}{m}\sum_{j=1}^m \delta_{y_j}, one forms Gram matrices:

  • KxxRn×nK_{xx} \in \mathbb{R}^{n\times n}, KyyRm×mK_{yy} \in \mathbb{R}^{m\times m}, KxyRn×mK_{xy} \in \mathbb{R}^{n\times m} with Kxx,ij=k(xi,xj)K_{xx,ij} = k(x_i, x_j), Kyy,ij=k(yi,yj)K_{yy,ij} = k(y_i, y_j), Kxy,ij=k(xi,yj)K_{xy,ij} = k(x_i, y_j).
  • The mixed Gram matrix KR(n+m)×(n+m)K \in \mathbb{R}^{(n+m) \times (n+m)} assembles the covariance structure across the combined sample.

The regularized divergence for empirical measures admits a matrix trace formula: KKLα(p^q^)=Tr[1nlog(1n)]Tr[IαKlogK]\mathrm{KKL}_\alpha(\hat{p} \| \hat{q}) = \mathrm{Tr}\left[\frac{1}{n} \log\left(\frac{1}{n}\right)\right] - \mathrm{Tr}[ I_\alpha K \log K ] where IαI_\alpha is block-diagonal with 1/α1/\alpha on the n×nn \times n block and zeros elsewhere. This enables KKLα\mathrm{KKL}_\alpha to be computed in closed form in O((n+m)3)O((n+m)^3) time via diagonalization, making practical implementation viable for moderate sample sizes (Chazal et al., 2024).

3. Theoretical Properties: Consistency and Geometry

The regularization parameter α\alpha directly controls the interpolation between strict KKL and degeneracy:

  • As α0\alpha \to 0, KKLαKKL\mathrm{KKL}_\alpha \to \mathrm{KKL} at rate O(α)O(\alpha). Sharp upper bounds quantify this deviation when pp is absolutely continuous with respect to qq and the density ratio is bounded.
  • Finite sample bounds scale as 1/(αn)1/(\alpha\sqrt{n}) for empirical means and as O((logn)2/(αn))O((\log n)^2/(\alpha n)) for higher moments, making choice of α\alpha critical for bias-variance trade-off in finite-sample settings.

Geometrically, KKLα\mathrm{KKL}_\alpha can be interpreted as a kernel-smoothed quantum/standard KL functional—lying between the standard KL divergence and the natural kernel-smoothing of pp and qq. In Wasserstein geometry, this regularization ensures KKLα\mathrm{KKL}_\alpha is smooth for discrete distributions (finite-rank operators), allowing well-posed gradient flows and optimization (Chazal et al., 2024).

4. Wasserstein Gradient Descent: Optimization via Regularized KKL

Gradient flows of KKLα\mathrm{KKL}_\alpha can be computed explicitly in the Wasserstein metric for empirical measures: tpt+(ptF(pt))=0\frac{\partial}{\partial t} p_t + \nabla \cdot \left( p_t \nabla F'(p_t) \right) = 0 where F(p)=KKLα(pq)F(p) = \mathrm{KKL}_\alpha(p\|q) and F(p)F'(p) has closed form involving kernel evaluations and eigendecomposition of KK. The time-discretized push-forward update for particles,

xi+1=xiγxF(p^)(xi)x_i^{\ell+1} = x_i^\ell - \gamma \nabla_x F'(\hat{p}_\ell)(x_i^\ell)

defines a method akin to SVGD, with the regularized divergence providing the objective landscape.

Efficient implementation recommendations:

  • Precompute Gram matrices and diagonalize KK per iteration.
  • For moderate nn, the computational cost is dominated by eigendecomposition.
  • Choice of α1/n\alpha \gg 1/\sqrt{n} controls variance.
  • Quasi-Newton (L-BFGS) and auto-differentiation can accelerate convergence and gradient computation (Chazal et al., 2024).

5. Nonlinear Contracting Regularization in Observer Design

In nonlinear observer theory, regularized KKL frameworks refer to the replacement of linear filters with nonlinear, contracting dynamics—using scalar contraction kernels—to improve robustness and convergence. Here, the system is assumed strongly differentially observable; the observer state zRmz\in\mathbb{R}^m evolves via

z˙=kΣ(z,y)withΣ(z,y)=[λ0σ(z0,y),,λm1σ(zm1,y)]\dot{z} = k\,\Sigma(z, y) \quad \text{with} \quad \Sigma(z, y) = [\lambda_0 \sigma(z_0, y), \ldots, \lambda_{m-1} \sigma(z_{m-1}, y)]^\top

for k>0k>0 and contraction kernel σ\sigma ensuring strong contraction in zz.

The key regularization mechanism:

  • The nonlinearity of σ\sigma is designed so that for large residuals, the observer exhibits a fast correction ("a_fast" gain); for small errors, the response is slow ("a_slow" gain), improving noise rejection.
  • A concrete form,

σ(u,y)=afast(uy)+(aslowafast)  tanh(uy)\sigma(u, y) = a_\text{fast}(u − y) + (a_\text{slow}−a_\text{fast})\;\tanh(u-y)

achieves this interpolating gain (Pachy et al., 2024).

6. Numerical Illustration and Practical Considerations

Demonstrations on the Duffing oscillator confirm that regularized (nonlinear) KKL observers combine (i) convergence speeds superior to slow-linear designs and (ii) noise sensitivity close to slow-linear but much improved over fast-linear filters. For comparison, mean convergence times for error reduction were 0.83s (fast-linear KKL), 6.79s (slow-linear), 2.27s (nonlinear KKL); mean noise gain factors were 7.57 (fast-linear), 1.15 (slow-linear), 1.95 (nonlinear).

Recommendations for practical use include tuning α\alpha or contraction gains to match data or plant characteristics, precomputing required operator quantities, and using offline calculations or gradient approaches appropriate for available computational resources (Chazal et al., 2024, Pachy et al., 2024).

7. Connections and Distinctive Features

Regularized KKL divergences and observers provide a unified approach to addressing limitations of both classical and kernel-based divergences in statistics and observer/filter designs in control:

  • In statistical learning, they address support mismatch and variance explosion, ensuring well-posedness for discrete, empirical, or non-overlapping distributions.
  • In nonlinear filtering, regularized nonlinear dynamics allow simultaneous acceleration of convergence and attenuation of noise, unattainable with linear gain choices.
  • Both frameworks exploit the smoothing/regularizing effects of convex mixing (in the divergence case) or nonlinear contraction (in the observer case), and can be implemented efficiently for finite samples or state dimensions.

The development of these regularized KKL frameworks has established robust methodologies for both statistical comparison of probability distributions and the design of nonlinear observers, with the dual effect of enhancing both theoretical properties and practical applicability (Chazal et al., 2024, Pachy et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Regularized KKL (Skewed/Smoothed Variant).