Sinkhorn Divergence: OT, Geometry & Applications

Updated 30 March 2026

Sinkhorn divergence is a debiased entropic-regularized optimal transport cost that accurately compares probability measures while ensuring zero self-similarity.
It induces a rigorous Riemannian metric on measure spaces, yielding explicit geodesic equations and facilitating gradient flow analysis.
Efficient computation via Sinkhorn-Knopp iterations underpins its applications in generative modeling, domain adaptation, and statistical registration.

The Sinkhorn divergence is a geometric functional for comparing probability measures, derived by debiasing the entropic-regularized optimal transport (OT) cost. It interpolates between the Wasserstein metric and kernel-based discrepancies such as Maximum Mean Discrepancy (MMD), and is central in modern computational optimal transport, statistical inference, and machine learning. The Sinkhorn divergence admits a rigorous Riemannian differential structure, explicit geometric and metric properties, efficient computational algorithms, and deep connections to RKHS theory, convex analysis, and gradient flows.

1. Mathematical Definition and Debiasing Principle

Given a compact metric space $X$ , probability measures $\mu,\nu\in\mathcal P(X)$ , a continuous cost $c:X\times X\to\mathbb{R}_+$ , and entropic regularization parameter $\varepsilon>0$ , the classical entropic OT cost is

$\mathrm{OT}_\varepsilon(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int_{X\times X} c(x,y)\, d\pi(x,y) + \varepsilon\, \mathrm{KL}(\pi\|\mu\otimes\nu)$

where $\Pi(\mu,\nu)$ is the set of couplings with marginals $\mu,\nu$ and $\mathrm{KL}(\pi\|\rho)=\int \frac{d\pi}{d\rho}\log\frac{d\pi}{d\rho} - 1\,d\rho$ .

This regularized cost does not vanish on the diagonal: $\mathrm{OT}_\varepsilon(\mu,\mu)>0$ , so it is not a distance. The Sinkhorn divergence is obtained via debiasing: $S_\varepsilon(\mu,\nu) := \mathrm{OT}_\varepsilon(\mu,\nu) - \tfrac{1}{2} \mathrm{OT}_\varepsilon(\mu,\mu) - \tfrac{1}{2} \mathrm{OT}_\varepsilon(\nu,\nu)$ This correction ensures $S_\varepsilon(\mu,\mu)=0$ and $S_\varepsilon(\mu,\nu) \geq 0$ , with equality if and only if $\mu=\nu$ under universal kernel assumptions (Lavenant et al., 2024).

2. Geometric Structure and Riemannian Geometry

The Sinkhorn divergence induces a Riemannian metric structure on the space of measures, obtained via differentiating $S_\varepsilon$ with respect to mass perturbations. For a path $\mu_t = \mu + t\,\delta\mu$ with $\delta\mu\in M_0(X)$ (zero integral), the metric tensor at $\mu$ is: $g_\mu(\delta\mu_1, \delta\mu_2) = \frac{\varepsilon}{2} \langle \delta\mu_1, (I - K_\mu^2)^{-1} H_\mu[\delta\mu_2] \rangle$ Here, $K_\mu$ and $H_\mu$ are integral operators defined via the Schrödinger dual potential and the entropic kernel $k_\mu(x,y) = \exp((f_{\mu,\mu}(x) + f_{\mu,\mu}(y) - c(x,y))/\varepsilon)$ . This metric "lifts" the geometry of $X$ to the measure space and canonically identifies the tangent space at $\mu$ with an RKHS dual space (Lavenant et al., 2024).

By charting the space via a fixed RKHS $\mathcal H_c$ (associated to $k_c(x,y)=e^{-c(x,y)/\varepsilon}$ ), the geometry becomes tractable and all tangent spaces embed smoothly into $\mathcal H_c$ . The geodesic equations can be explicitly written in $\mathcal H_c$ coordinates.

3. Metric, Topological, and Convexity Properties

Distance and topology: The metric induced by the energy minimization of $g_\mu$ (denoted $d_E$ ) fulfills symmetry, positivity, triangle inequality, and metrizes the weak-* topology on $\mathcal P(X)$ . Explicitly, there exists

$\sqrt{\frac{\varepsilon}{2}}\|\beta_1 - \beta_0\|_{\mathcal H_c} \le d_E(\mu_0,\mu_1) \le \sqrt{\frac{\varepsilon}{2}}\pi C\|\beta_1 - \beta_0\|_{\mathcal H_c}$

with constants and map $\mu \mapsto \beta \in \mathcal H_c$ (Lavenant et al., 2024).

Non-metricity of $\sqrt{S_\varepsilon}$ : The square root $\sqrt{S_\varepsilon}$ does not satisfy the triangle inequality. Explicit counterexamples (e.g., mixtures of Diracs or Gaussians) demonstrate failure of the metric property for any power $S_\varepsilon^\alpha,\,\alpha\ge1/2$ (Lavenant et al., 2024).

Convexity: $S_\varepsilon$ is not jointly convex in $(\mu,\nu)$ : its Hessian can be indefinite, as established via two-point examples and asymptotic arguments. This differentiates it from the classical Wasserstein distance, which is jointly convex (Lavenant et al., 2024).

Smoothness: The divergence is infinitely differentiable inside the probability simplex, benefiting optimization and statistical procedures (Lara et al., 2022, Lavenant et al., 2024).

4. Computation, Algorithmics, and Scaling

Sinkhorn-Knopp Iterations: The functional is computed efficiently using the Sinkhorn-Knopp scaling algorithm. For discrete measures and Gibbs kernel $K_{ij} = \exp(-c_{ij}/\varepsilon)$ , the dual potentials $u,v$ are obtained via alternating updates: $u \leftarrow a/(K v),\qquad v \leftarrow b/(K^\top u)$ and the optimal coupling is $\pi^*_{ij} = u_i K_{ij} v_j$ . The computational complexity per iteration is $O(nm)$ , but can be reduced using positive feature decompositions, RKHS approaches, or hierarchical matrix representations for structured problems (Scetbon et al., 2020, Motamed, 2020).

Automatic Differentiation: The smooth dependence of $S_\varepsilon$ on input measures permits efficient gradient computation using backpropagation through fixed-point iterations (Genevay et al., 2017).

Statistical Efficiency: The sample complexity and convergence rates interpolate between $n^{-1/2}$ for MMD and $n^{-1/d}$ for classical OT, depending on $\varepsilon$ (Genevay et al., 2017, Chizat et al., 2020).

Large-scale/linear time approximations: For large $n$ , positive feature maps enable $O(n r)$ per iteration scaling without forming explicit full kernel matrices (Scetbon et al., 2020). Hierarchical low-rank methods can further reduce runtimes to $O(n\log^3 n)$ under Kronecker or block-structure cost assumptions (Motamed, 2020).

5. Applications and Practical Implementations

Sinkhorn divergence is used as a loss in generative modeling (GAN training), domain adaptation, deep feature alignment, and statistical registration problems.

Generative Models: Sinkhorn-based losses are integrated in adversarial training to enable robust optimization over distributions. The Sinkhorn Natural Gradient (SiNG) algorithm uses the explicit Riemannian structure for efficient optimization over parameterized measure families (Shen et al., 2020).
Statistical Registration: In diffeomorphic image or shape registration, Sinkhorn divergences offer smooth, fewer-minima losses than classical OT, and admit automatic differentiation and statistical guarantees on consistency (Lara et al., 2022).
Deep Representation Learning: The divergence allows stack-wise alignment in deep regression architectures, e.g., N-BEATS, and provides thermostat-like regularization in feature space across domains (Lee et al., 2023).
Coreset Compression: Convexly-weighted coresets for Sinkhorn divergence can be constructed via second-order functional expansions, reducing the problem to MMD minimization in the RKHS induced by the divergence’s Hessian. This enables selection of $O(\log n)$ representative points, critical for scaling up Sinkhorn computations (Kokot et al., 28 Apr 2025).

6. Generalizations, Gradient Flows, and Theoretical Extensions

Gradient Flows: Substituting the Wasserstein metric with the Sinkhorn divergence in the JKO (minimizing movement) scheme defines a new gradient flow in measure space. The evolution is best understood in a variable change to the RKHS $\mathcal H_c$ , where it becomes a monotone operator evolution. This flow is well-posed, dissipative, non-expansive in RKHS norm, and converges globally to minimizers of the potential energy. Notably, it regularizes mass-splitting singularities and allows overcoming potential barriers that would trap classical Wasserstein flows (Hardion et al., 18 Nov 2025).

Drift Field Interpretation: In generative dynamics, the Sinkhorn divergence induces a drift vector field decomposed into cross (model-to-target) and self-repulsion terms, both computed using entropic couplings via two-sided Sinkhorn scaling. This structure robustifies parameter updates, ensures identifiability (zero drift iff measures agree), and improves empirical stability over one-sided "drifting" dynamics (He et al., 12 Mar 2026).

Unbalanced and $f$ -Divergence Regularized OT: The Sinkhorn divergence framework extends to unbalanced transport (measures with varying mass) by relaxing marginal constraints via Csiszár $f$ -divergences (Séjourné et al., 2019). The generalized Sinkhorn divergence incorporates arbitrary smooth convex penalty functions, leading to a family of divergences and dual algorithms tailored to application constraints on sparsity and statistical robustness (Terjék et al., 2021).

Asymptotic Distributions and Bootstrap Theory: The Sinkhorn divergence admits Hadamard differentiability with respect to the input measures, enabling a precise central limit theorem for empirical estimators. Limit distributions (under the null and under alternatives), bootstrap consistency, and asymptotic minimax efficiency are derived, providing a full statistical toolkit for hypothesis testing, confidence intervals, and sample complexity controls (Lara et al., 2022, Goldfeld et al., 2022, Bercu et al., 2018).

7. Limitations and Negative Results

Although the Sinkhorn divergence is strictly positive definite and metrizes weak-* convergence, it lacks true joint convexity and its square-root is not a metric, failing the triangle inequality (Lavenant et al., 2024). These features distinguish it structurally from classical Wasserstein distances and require caution when interpreting $S_\varepsilon$ as a metric or deploying it in convex optimization frameworks.

The Sinkhorn divergence serves as a foundational object in computational optimal transport and geometric statistics, combining computational tractability, rich geometric structure, and broad applicability in high-dimensional data science tasks (Lavenant et al., 2024, Feydy et al., 2018, Genevay et al., 2017, Shen et al., 2020).