Sinkhorn-Approximated Losses

Updated 17 November 2025

Sinkhorn-approximated losses are a class of optimal transport-based loss functions that use entropic regularization to produce differentiable and statistically robust metrics.
They leverage fixed-point Sinkhorn iterations to efficiently approximate the regularized OT cost, ensuring convergence of the dual potentials under marginal constraints.
Advanced variants extend this framework to online, partial transport, and generalized settings, enabling applications in generative modeling, robust optimization, and distributionally robust learning.

Sinkhorn-approximated losses are a class of optimal transport-based loss functions incorporating entropic regularization to achieve tractable, differentiable, and statistically robust optimization over probability distributions. These losses interpolate between optimal transport (OT) metrics such as Wasserstein distance and kernel-based alternatives like maximum mean discrepancy (MMD), and form the computational backbone of modern generative modeling, robust optimization, and related machine learning frameworks.

1. Mathematical Foundations of Sinkhorn Losses

Sinkhorn-approximated losses originate from the regularized OT problem. Given two probability measures $\mu = \sum_{i=1}^n a_i \delta_{x_i}$ and $\nu = \sum_{j=1}^m b_j \delta_{y_j}$ , and a cost matrix $C \in \mathbb{R}^{n\times m}$ with entries $C_{ij}=c(x_i,y_j)$ , the entropic OT cost is

$W_\varepsilon(\mu, \nu) = \min_{\pi \in \Pi(a, b)} \sum_{i,j} C_{ij} \pi_{ij} + \varepsilon\sum_{i,j} \pi_{ij}(\log \pi_{ij} - 1),$

where $\Pi(a, b) = \{ \pi \in \mathbb{R}_+^{n \times m} : \pi 1_m = a, \pi^T 1_n = b \}$ is the set of admissible couplings.

The Sinkhorn divergence, used to debias entropic OT and ensure metric properties, is

$S_\varepsilon(\mu, \nu) = W_\varepsilon(\mu, \nu) - \frac{1}{2} W_\varepsilon(\mu, \mu) - \frac{1}{2} W_\varepsilon(\nu, \nu).$

Tuning $\varepsilon$ interpolates between the geometric OT regime ( $\varepsilon \to 0$ ) and the MMD regime ( $\varepsilon \to \infty$ ), where the coupling approaches the independent product $\mu \otimes \nu$ and $S_\varepsilon$ degenerates to an energy or kernel-based divergence.

2. Computational Algorithms: Sinkhorn Iterations

The central computational procedure for evaluating Sinkhorn losses is a fixed-point iteration referred to as the Sinkhorn algorithm. Defining the Gibbs kernel $K_{ij} = \exp(-C_{ij}/\varepsilon)$ , the optimal plan is sought as $\pi = \mathrm{diag}(u) K \mathrm{diag}(v)$ subject to marginal constraints encoded by the iterative updates: $u \leftarrow \frac{a}{K v},\qquad v \leftarrow \frac{b}{K^T u}$ All vector divisions are element-wise. After $L$ iterations, the coupling is $\pi^{(L)} = \mathrm{diag}(u^{(L)}) K \mathrm{diag}(v^{(L)})$ , yielding an approximate cost $W_\varepsilon(\mu, \nu) \approx \sum_{i,j} C_{ij} \pi^{(L)}_{ij}$ (Genevay et al., 2017). This scheme admits robust implementation on modern GPUs with batch sizes $n, m$ in the range $128$–$512$ and iteration count $L$ typically $10$–$50$ for $\varepsilon \geq 0.1$ .

For generalized OT problems including partial transport or unbalanced couplings, the Sinkhorn framework adapts via proximal/divide updates and clipping/min operations on dual scalings $u, v$ to enforce penalized marginal constraints (Bai, 9 Jul 2024).

3. Differentiability, Gradient Computation, and Implicit Differentiation

The entropic regularization confers infinite differentiability ( $C^\infty$ ) on both the Sinkhorn cost and Sinkhorn divergence in the simplex interior (Luise et al., 2018). Efficient gradient computation leverages either (i) backpropagation through unrolled Sinkhorn iterations, or (ii) implicit differentiation of the KKT conditions: $\varepsilon \log P^* + C + u 1_n^T + 1_m v^T = 0, \qquad P^* 1_n = a, \quad (P^*)^T 1_m = b,$ where $P^*$ is the optimal plan and $(u, v)$ dual potentials. Solving the resulting sparse linear system yields vector-Jacobian products for $\frac{\partial \ell}{\partial C}, \frac{\partial \ell}{\partial a}, \frac{\partial \ell}{\partial b}$ analytic gradients with provable error bounds (Eisenberger et al., 2022). Implicit differentiation is advantageous for high $L$ or $n,m$ , providing superior memory efficiency over unrolled automatic differentiation (AD).

4. Statistical and Learning-Theoretic Properties

Entropic regularization produces strictly convex transport losses with improved sample complexity and variance. For small $\varepsilon$ , the OT bias persists with the curse of dimensionality ( $O(n^{-1/d})$ ); for large $\varepsilon$ , the Sinkhorn loss exhibits MMD-like sample complexity ( $O(n^{-1/2})$ ) (Genevay et al., 2017). In supervised learning, the sharp Sinkhorn loss ( $S_\lambda$ without explicit entropy penalty in the final cost) guarantees universal consistency and fast excess risk convergence rates under standard RKHS assumptions (Luise et al., 2018).

Recent work verifies second-order Hadamard differentiability of Sinkhorn divergences, facilitating local quadratic approximations and enabling rigorous coreset construction (Kokot et al., 28 Apr 2025). This functional smoothness underpins compressed representations and efficient subsampling schemes.

5. Advanced Variants: Online, Generalized, and Partial Transport

Algorithms such as Online Sinkhorn (Mensch et al., 2020) permit stochastic streaming estimation of Sinkhorn-approximated losses, maintaining non-parametric mixture representations of scaling potentials $(u_t, v_t)$ :

New sample batches update the mixture weights via stochastic approximation (Robbins–Monro step),
Theoretical guarantees yield near-optimal $O(1/\sqrt{N})$ error rates for $N$ samples.

Generalized and partial transport models (GOPT) introduce penalty functions $\lambda_1, \lambda_2$ for mass destruction/creation and enable coupled clipping/min operations inside Sinkhorn iterates. This confers flexibility over the balanced/unbalanced spectrum by adjusting mass constraints in the primal and dual (Bai, 9 Jul 2024).

Convex regularization beyond Shannon entropy is accommodated in generalized Sinkhorn frameworks (Marino et al., 2020), including regularizers $\Phi$ (e.g. Tsallis, quadratic), with each instantiation yielding a corresponding dual, complementary slackness, and IPFP-type iterative scaling.

6. Practical Implementation, Optimization, and Applications

In contemporary deep learning, Sinkhorn layers are integrated end-to-end:

Cost networks compute $C_{ij}$ from learned features,
Sinkhorn iterations yield the coupling $\pi$ and the regularized OT loss,
Backpropagation exploits differentiability of matrix/tensor operations.

Loss surrogate terms in physics-informed neural networks, generative adversarial nets (GANs), Schrödinger bridges, and robust optimization pipelines use Sinkhorn divergences to enforce distributional constraints or supply differentiable distribution-matching objectives (Genevay et al., 2017, Nodozi et al., 2023, Wang et al., 2021). Distributionally robust optimization (DRO) with Sinkhorn balls replaces the hard Wasserstein supremum by a smooth log-sum-exp, making worst-case training tractable (Wang et al., 2021).

Specific algorithms, such as CO2-coresets, compress large datasets for regularized Sinkhorn loss minimization via spectral decompositions and MMD-based matching in RKHS, with polylogarithmic coreset size and near-optimal approximation error (Kokot et al., 28 Apr 2025).

7. Computational Complexity, Approximation Strategies, and Scalability

Each Sinkhorn iteration is $O(nm)$ ; total cost is $O(Lnm)$ for $L$ steps. For large $n, m$ , strategies include:

Screening (Screenkhorn) to selectively freeze negligible dual components (Alaya et al., 2019),
Mini-batching to fit cost matrices in memory,
Warm-starting dual variables for successive optimization loops,
Implicit differentiation for scalable backward passes with fixed memory requirements.

Approximation bounds characterize the trade-off between marginal violations and objective error; screened or compressed solves can achieve marginal error below $10^{-3}$ and substantial computational savings given appropriate budgets (Alaya et al., 2019).

Sinkhorn-approximated losses synthesize optimal transport theory, entropic regularization, and iterative matrix scaling into a scalable, robust, and fully differentiable loss framework supporting a wide spectrum of modern statistical, optimization, and learning applications. With explicit control via $\varepsilon$ and extensions to convex regularizers, partial mass regimes, and sample-streaming/distributionally robust pipelines, Sinkhorn divergences constitute a foundational tool for high-dimensional generative modeling, robust learning, and applied optimal transport in both discrete and continuous settings.