Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Sinkhorn-Approximated Losses

Updated 17 November 2025
  • Sinkhorn-approximated losses are a class of optimal transport-based loss functions that use entropic regularization to produce differentiable and statistically robust metrics.
  • They leverage fixed-point Sinkhorn iterations to efficiently approximate the regularized OT cost, ensuring convergence of the dual potentials under marginal constraints.
  • Advanced variants extend this framework to online, partial transport, and generalized settings, enabling applications in generative modeling, robust optimization, and distributionally robust learning.

Sinkhorn-approximated losses are a class of optimal transport-based loss functions incorporating entropic regularization to achieve tractable, differentiable, and statistically robust optimization over probability distributions. These losses interpolate between optimal transport (OT) metrics such as Wasserstein distance and kernel-based alternatives like maximum mean discrepancy (MMD), and form the computational backbone of modern generative modeling, robust optimization, and related machine learning frameworks.

1. Mathematical Foundations of Sinkhorn Losses

Sinkhorn-approximated losses originate from the regularized OT problem. Given two probability measures μ=i=1naiδxi\mu = \sum_{i=1}^n a_i \delta_{x_i} and ν=j=1mbjδyj\nu = \sum_{j=1}^m b_j \delta_{y_j}, and a cost matrix CRn×mC \in \mathbb{R}^{n\times m} with entries Cij=c(xi,yj)C_{ij}=c(x_i,y_j), the entropic OT cost is

Wε(μ,ν)=minπΠ(a,b)i,jCijπij+εi,jπij(logπij1),W_\varepsilon(\mu, \nu) = \min_{\pi \in \Pi(a, b)} \sum_{i,j} C_{ij} \pi_{ij} + \varepsilon\sum_{i,j} \pi_{ij}(\log \pi_{ij} - 1),

where Π(a,b)={πR+n×m:π1m=a,πT1n=b}\Pi(a, b) = \{ \pi \in \mathbb{R}_+^{n \times m} : \pi 1_m = a, \pi^T 1_n = b \} is the set of admissible couplings.

The Sinkhorn divergence, used to debias entropic OT and ensure metric properties, is

Sε(μ,ν)=Wε(μ,ν)12Wε(μ,μ)12Wε(ν,ν).S_\varepsilon(\mu, \nu) = W_\varepsilon(\mu, \nu) - \frac{1}{2} W_\varepsilon(\mu, \mu) - \frac{1}{2} W_\varepsilon(\nu, \nu).

Tuning ε\varepsilon interpolates between the geometric OT regime (ε0\varepsilon \to 0) and the MMD regime (ε\varepsilon \to \infty), where the coupling approaches the independent product μν\mu \otimes \nu and SεS_\varepsilon degenerates to an energy or kernel-based divergence.

2. Computational Algorithms: Sinkhorn Iterations

The central computational procedure for evaluating Sinkhorn losses is a fixed-point iteration referred to as the Sinkhorn algorithm. Defining the Gibbs kernel Kij=exp(Cij/ε)K_{ij} = \exp(-C_{ij}/\varepsilon), the optimal plan is sought as π=diag(u)Kdiag(v)\pi = \mathrm{diag}(u) K \mathrm{diag}(v) subject to marginal constraints encoded by the iterative updates: uaKv,vbKTuu \leftarrow \frac{a}{K v},\qquad v \leftarrow \frac{b}{K^T u} All vector divisions are element-wise. After LL iterations, the coupling is π(L)=diag(u(L))Kdiag(v(L))\pi^{(L)} = \mathrm{diag}(u^{(L)}) K \mathrm{diag}(v^{(L)}), yielding an approximate cost Wε(μ,ν)i,jCijπij(L)W_\varepsilon(\mu, \nu) \approx \sum_{i,j} C_{ij} \pi^{(L)}_{ij} (Genevay et al., 2017). This scheme admits robust implementation on modern GPUs with batch sizes n,mn, m in the range $128$–$512$ and iteration count LL typically $10$–$50$ for ε0.1\varepsilon \geq 0.1.

For generalized OT problems including partial transport or unbalanced couplings, the Sinkhorn framework adapts via proximal/divide updates and clipping/min operations on dual scalings u,vu, v to enforce penalized marginal constraints (Bai, 9 Jul 2024).

3. Differentiability, Gradient Computation, and Implicit Differentiation

The entropic regularization confers infinite differentiability (CC^\infty) on both the Sinkhorn cost and Sinkhorn divergence in the simplex interior (Luise et al., 2018). Efficient gradient computation leverages either (i) backpropagation through unrolled Sinkhorn iterations, or (ii) implicit differentiation of the KKT conditions: εlogP+C+u1nT+1mvT=0,P1n=a,(P)T1m=b,\varepsilon \log P^* + C + u 1_n^T + 1_m v^T = 0, \qquad P^* 1_n = a, \quad (P^*)^T 1_m = b, where PP^* is the optimal plan and (u,v)(u, v) dual potentials. Solving the resulting sparse linear system yields vector-Jacobian products for C,a,b\frac{\partial \ell}{\partial C}, \frac{\partial \ell}{\partial a}, \frac{\partial \ell}{\partial b} analytic gradients with provable error bounds (Eisenberger et al., 2022). Implicit differentiation is advantageous for high LL or n,mn,m, providing superior memory efficiency over unrolled automatic differentiation (AD).

4. Statistical and Learning-Theoretic Properties

Entropic regularization produces strictly convex transport losses with improved sample complexity and variance. For small ε\varepsilon, the OT bias persists with the curse of dimensionality (O(n1/d)O(n^{-1/d})); for large ε\varepsilon, the Sinkhorn loss exhibits MMD-like sample complexity (O(n1/2)O(n^{-1/2})) (Genevay et al., 2017). In supervised learning, the sharp Sinkhorn loss (SλS_\lambda without explicit entropy penalty in the final cost) guarantees universal consistency and fast excess risk convergence rates under standard RKHS assumptions (Luise et al., 2018).

Recent work verifies second-order Hadamard differentiability of Sinkhorn divergences, facilitating local quadratic approximations and enabling rigorous coreset construction (Kokot et al., 28 Apr 2025). This functional smoothness underpins compressed representations and efficient subsampling schemes.

5. Advanced Variants: Online, Generalized, and Partial Transport

Algorithms such as Online Sinkhorn (Mensch et al., 2020) permit stochastic streaming estimation of Sinkhorn-approximated losses, maintaining non-parametric mixture representations of scaling potentials (ut,vt)(u_t, v_t):

  • New sample batches update the mixture weights via stochastic approximation (Robbins–Monro step),
  • Theoretical guarantees yield near-optimal O(1/N)O(1/\sqrt{N}) error rates for NN samples.

Generalized and partial transport models (GOPT) introduce penalty functions λ1,λ2\lambda_1, \lambda_2 for mass destruction/creation and enable coupled clipping/min operations inside Sinkhorn iterates. This confers flexibility over the balanced/unbalanced spectrum by adjusting mass constraints in the primal and dual (Bai, 9 Jul 2024).

Convex regularization beyond Shannon entropy is accommodated in generalized Sinkhorn frameworks (Marino et al., 2020), including regularizers Φ\Phi (e.g. Tsallis, quadratic), with each instantiation yielding a corresponding dual, complementary slackness, and IPFP-type iterative scaling.

6. Practical Implementation, Optimization, and Applications

In contemporary deep learning, Sinkhorn layers are integrated end-to-end:

  • Cost networks compute CijC_{ij} from learned features,
  • Sinkhorn iterations yield the coupling π\pi and the regularized OT loss,
  • Backpropagation exploits differentiability of matrix/tensor operations.

Loss surrogate terms in physics-informed neural networks, generative adversarial nets (GANs), Schrödinger bridges, and robust optimization pipelines use Sinkhorn divergences to enforce distributional constraints or supply differentiable distribution-matching objectives (Genevay et al., 2017, Nodozi et al., 2023, Wang et al., 2021). Distributionally robust optimization (DRO) with Sinkhorn balls replaces the hard Wasserstein supremum by a smooth log-sum-exp, making worst-case training tractable (Wang et al., 2021).

Specific algorithms, such as CO2-coresets, compress large datasets for regularized Sinkhorn loss minimization via spectral decompositions and MMD-based matching in RKHS, with polylogarithmic coreset size and near-optimal approximation error (Kokot et al., 28 Apr 2025).

7. Computational Complexity, Approximation Strategies, and Scalability

Each Sinkhorn iteration is O(nm)O(nm); total cost is O(Lnm)O(Lnm) for LL steps. For large n,mn, m, strategies include:

  • Screening (Screenkhorn) to selectively freeze negligible dual components (Alaya et al., 2019),
  • Mini-batching to fit cost matrices in memory,
  • Warm-starting dual variables for successive optimization loops,
  • Implicit differentiation for scalable backward passes with fixed memory requirements.

Approximation bounds characterize the trade-off between marginal violations and objective error; screened or compressed solves can achieve marginal error below 10310^{-3} and substantial computational savings given appropriate budgets (Alaya et al., 2019).


Sinkhorn-approximated losses synthesize optimal transport theory, entropic regularization, and iterative matrix scaling into a scalable, robust, and fully differentiable loss framework supporting a wide spectrum of modern statistical, optimization, and learning applications. With explicit control via ε\varepsilon and extensions to convex regularizers, partial mass regimes, and sample-streaming/distributionally robust pipelines, Sinkhorn divergences constitute a foundational tool for high-dimensional generative modeling, robust learning, and applied optimal transport in both discrete and continuous settings.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sinkhorn-approximated Losses.