Sinkhorn Proxy Drift

Updated 9 May 2026

Sinkhorn Proxy Drift is a concept that uses the Sinkhorn divergence to approximate Wasserstein gradient flows, enabling efficient proxy drift fields.
It offers a practical method for trading off between numerical tractability and theoretical fidelity via entropic regularization.
Applications include stable neural generative modeling and scalable particle flow approximations in high-dimensional optimal transport problems.

Sinkhorn Proxy Drift is an umbrella term for procedures that use the Sinkhorn divergence—a symmetric, entropically regularized optimal transport (OT) functional—as a tractable surrogate ("proxy") in variational inference, gradient flows, and generative model training. This concept originates in the mathematical and algorithmic analysis of Wasserstein gradient flows (WGF) and their entropic approximations, leading to a spectrum of "proxy" drift fields that interpolate between exact transport-driven dynamics and more tractable, scalable alternatives. The paradigm encompasses rigorous mathematical consequences for continuous PDEs, practical drift-field approximations for particle flows, and state-of-the-art neural generative procedures. The defining feature of Sinkhorn proxy drift is the use of the Sinkhorn divergence, rather than the unregularized Wasserstein distance, to define and/or estimate the transporting vector field, often in a way that trades statistical or computational tractability for the theoretical fidelity of the gradient flow.

1. Mathematical Foundations: Sinkhorn Divergence and Entropic Optimal Transport

For probability measures $\rho$ and $\nu$ on $\mathbb{R}^d$ , with quadratic cost $c(x, y) = \|x - y\|^2$ , the entropic-regularized optimal transport cost is defined as

$\mathrm{OT}_\varepsilon(\rho, \nu) = \min_{\pi \in \Pi(\rho, \nu)} \int c(x, y)\,d\pi(x, y) + \varepsilon\,\mathrm{KL}(\pi \|\rho \otimes \nu).$

The Sinkhorn divergence is the symmetric, bias-corrected functional

$D_\varepsilon(\rho \| \nu) = \mathrm{OT}_\varepsilon(\rho, \nu) - \tfrac{1}{2} \mathrm{OT}_\varepsilon(\rho, \rho) - \tfrac{1}{2} \mathrm{OT}_\varepsilon(\nu, \nu).$

The entropic parameter $\varepsilon > 0$ controls the trade-off between transport accuracy and entropy regularization. As $\varepsilon \to 0$ , $D_\varepsilon$ recovers the squared Wasserstein-2 distance.

The Sinkhorn divergence is computationally accessible via the Sinkhorn algorithm and admits dual characterizations via potentials $\phi^*, \psi^*$ , which yield efficient minibatch and GPU-based estimators, enabling its integration as a "proxy" in high-dimensional inference and learning systems (Zhu et al., 2024).

2. Gradient Flows and the Notion of Drift

Taking $\nu$ 0 as a variational objective, the Wasserstein gradient flow (WGF) equation for the evolving law $\nu$ 1 is

$\nu$ 2

It is a steepest-descent evolution in the Wasserstein geometry. The first variation of the Sinkhorn divergence is given by

$\nu$ 3

where $\nu$ 4 is the solution to the entropic OT dual problem from $\nu$ 5 to $\nu$ 6, and $\nu$ 7 is the (self-)potential for $\nu$ 8. Thus, the Sinkhorn drift (editor's term) is

$\nu$ 9

This explicit cross-minus-self structure can be realized empirically via barycentric projections using the OT plans induced by Sinkhorn scaling (Zhu et al., 2024, He et al., 12 Mar 2026, Gretton et al., 6 May 2026).

3. Sinkhorn Proxy Drift: Particle Approximation and One-Shot Proxies

Full computation of the exact Sinkhorn drift at each iteration is computationally expensive, especially in large-batch or high-dimensional regimes. "Sinkhorn proxy drift" refers to a practical procedure where the drift is approximated as follows:

Compute softened pairwise costs between model (particle set $\mathbb{R}^d$ 0) and data (particle set $\mathbb{R}^d$ 1).
Use low-iteration (even $\mathbb{R}^d$ 2) Sinkhorn scaling, leading to one-sided or geometric-mean normalized "proxy" couplings (i.e., not enforcing full row and column marginals).
Form the drift field for each sample as a cross-minus-self sum:

$\mathbb{R}^d$ 3

where $\mathbb{R}^d$ 4 and $\mathbb{R}^d$ 5 are proxy Gibbs kernels or geometric-mean pseudo-plans (Gretton et al., 6 May 2026).

This proxy drift is consistent (vanishes if and only if model and target match) but is not always conservative; it generically fails to be a gradient field unless additional alignment conditions are met.

The method is computationally efficient, requires only a forward pass (no OT backpropagation), and remains stable at lower entropic regularization than unregularized OT, though it can underperform in resolving mass splits between well-separated modes (Gretton et al., 6 May 2026, He et al., 12 Mar 2026).

4. Neural and Algorithmic Realizations

The Neural Sinkhorn Gradient Flow (NSGF) framework parameterizes the time-dependent velocity field $\mathbb{R}^d$ 6 via a neural network and trains it to regress to the empirical Sinkhorn-based velocity estimate using a velocity-matching loss: $\mathbb{R}^d$ 7 where $\mathbb{R}^d$ 8 is an unbiased estimator of the true Sinkhorn drift, constructed from mini-batches of samples from the source and target distributions using empirical Sinkhorn plans (Zhu et al., 2024). The NSGF++ scheme introduces a two-phase transport: an initial phase of Sinkhorn-driven flow deemed sufficient to reach the data manifold, followed by a straight-line refinement toward prescribed data points.

A widely adopted pseudocode for the "Sinkhorn drift" operation in batch-based settings is: $\mathrm{OT}_\varepsilon(\rho, \nu) = \min_{\pi \in \Pi(\rho, \nu)} \int c(x, y)\,d\pi(x, y) + \varepsilon\,\mathrm{KL}(\pi \|\rho \otimes \nu).$ 3 where $\mathbb{R}^d$ 9 is the number of Sinkhorn iterations. When $c(x, y) = \|x - y\|^2$ 0, the result reduces to a naive one-sided "drifting" update as in recent GMD algorithms; for moderate $c(x, y) = \|x - y\|^2$ 1, the coupling approaches the true doubly-stochastic plan, yielding higher-fidelity drifts (He et al., 12 Mar 2026).

5. Theoretical Properties and Identifiability

The Sinkhorn divergence is strictly positive definite: $c(x, y) = \|x - y\|^2$ 2 if and only if $c(x, y) = \|x - y\|^2$ 3 (He et al., 12 Mar 2026, Gretton et al., 6 May 2026). Consequently, its gradient flow admits a unique equilibrium at the target law. In the context of proxy drift, the proxy vector field vanishes only if empirical or population-level distributions match—the so-called identifiability property—resolving a well-documented gap in previous "drifting" frameworks based only on kernel means or one-sided normalization. This property holds both in continuous and empirical particle settings under mild nondegeneracy (e.g., support points distinct) (He et al., 12 Mar 2026, Gretton et al., 6 May 2026). For the approximated (proxy) field, the consistency property is preserved, though higher-order mass transportation properties may be compromised.

6. Sinkhorn Proxy Drift in the JKO Scheme and PDEs

In the time-discrete Jordan–Kinderlehrer–Otto (JKO) minimization scheme for constructing Wasserstein gradient flows, replacing $c(x, y) = \|x - y\|^2$ 4 with the entropic (Sinkhorn) cost yields the entropic JKO step: $c(x, y) = \|x - y\|^2$ 5 It is shown that in the diffusive regime $c(x, y) = \|x - y\|^2$ 6, the limiting PDE acquires an extra linear diffusion term: $c(x, y) = \|x - y\|^2$ 7 This additional $c(x, y) = \|x - y\|^2$ 8 drift is termed the "Sinkhorn proxy drift" in the PDE context. For $c(x, y) = \|x - y\|^2$ 9 (i.e., $\mathrm{OT}_\varepsilon(\rho, \nu) = \min_{\pi \in \Pi(\rho, \nu)} \int c(x, y)\,d\pi(x, y) + \varepsilon\,\mathrm{KL}(\pi \|\rho \otimes \nu).$ 0), the classical Wasserstein gradient flow is recovered. The scaling $\mathrm{OT}_\varepsilon(\rho, \nu) = \min_{\pi \in \Pi(\rho, \nu)} \int c(x, y)\,d\pi(x, y) + \varepsilon\,\mathrm{KL}(\pi \|\rho \otimes \nu).$ 1 allows practitioners to stabilize Sinkhorn computations at the expense of diffusive bias, facilitating a trade-off between numerical tractability and fidelity to pure Wasserstein-driven evolution (Baradat et al., 18 Feb 2025).

7. Empirical and Practical Significance

In generative modeling tasks—particularly those prone to low-temperature pathologies such as mode collapse—Sinkhorn proxy drift methods consistently outperform one-sided and kernel-density-induced drifts. For example, Sinkhorn drifting reduces mean FID score on FFHQ-ALAE from 187.7 to 37.1 and mean EMD from 453.3 to 144.4 at the lowest temperature, while maintaining full class coverage on MNIST across temperature sweeps. The overhead of using Sinkhorn drift (i.e., a handful of forward Sinkhorn iterations per batch) is modest and does not alter the inference procedure at test time (He et al., 12 Mar 2026).

8. Limitations and Open Problems

While Sinkhorn proxy drift provides a tractable and theoretically justified surrogate for optimal transport-driven flows, the proxy field (when not fully Sinkhorn-scaled) may fail to be conservative (not generally expressible as the gradient of any global functional) and may not transport mass optimally between widely separated modes at finite entropic regularization. Empirical proxy drifts converge to the true WGF drift only asymptotically as batch sizes tend to infinity and iteration count $\mathrm{OT}_\varepsilon(\rho, \nu) = \min_{\pi \in \Pi(\rho, \nu)} \int c(x, y)\,d\pi(x, y) + \varepsilon\,\mathrm{KL}(\pi \|\rho \otimes \nu).$ 2 in Sinkhorn scaling. Thus, practical deployments must tune entropic parameters and iteration budgets to balance statistical error, numerical stability, and approximation quality (Gretton et al., 6 May 2026, Baradat et al., 18 Feb 2025).

References

Neural Sinkhorn Gradient Flow (Zhu et al., 2024)
Sinkhorn-Drifting Generative Models (He et al., 12 Mar 2026)
Using Sinkhorn in the JKO scheme adds linear diffusion (Baradat et al., 18 Feb 2025)
On the Wasserstein Gradient Flow Interpretation of Drifting Models (Gretton et al., 6 May 2026)

Markdown Report Issue Upgrade to Chat

References (4)

Neural Sinkhorn Gradient Flow (2024)

Sinkhorn-Drifting Generative Models (2026)

On the Wasserstein Gradient Flow Interpretation of Drifting Models (2026)

Using Sinkhorn in the JKO scheme adds linear diffusion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sinkhorn Proxy Drift.