Entropic Regularization in Optimal Transport

Updated 22 February 2026

Entropic regularization of optimal transport is a method that integrates a Shannon entropy penalty to transform a linear program into a strictly convex and smooth optimization problem.
It leverages algorithms like Sinkhorn and Sinkhorn–Newton to achieve efficient convergence, reducing computational complexity and ensuring scalability.
The approach also enhances statistical robustness and selects unique optimal transport plans, balancing approximation bias with estimation variance.

Entropic regularization of optimal transportation is a mathematical framework that modifies classical optimal transport (OT) by adding an entropy term to the cost functional, thereby transforming the original linear program into a strictly convex and smooth optimization problem. This regularization has profound consequences for existence, uniqueness, regularity, computational tractability, and statistical properties in discrete and continuous OT problems.

1. Problem Formulation and Variational Structure

Consider two discrete probability vectors $a \in \Sigma_n$ and $b \in \Sigma_m$ (i.e., nonnegative, sum to one), and a nonnegative cost matrix $C \in \mathbb{R}_+^{n \times m}$ . The set of admissible couplings is

$U(a, b) = \{P \in \mathbb{R}_+^{n \times m} : P 1_m = a,\, P^T 1_n = b\}.$

The classical OT problem is the minimization of the transport cost $\langle C, P \rangle$ over couplings $P$ in $U(a, b)$ . Entropic regularization adds an entropy penalty: $\min_{P \in U(a, b)} \langle C, P \rangle + \varepsilon \sum_{i=1}^n \sum_{j=1}^m P_{ij} (\log P_{ij} - 1).$ The discrete Shannon entropy term $\sum_{ij} P_{ij} (\log P_{ij} - 1)$ ensures strict convexity, and the regularization parameter $\varepsilon > 0$ balances transport cost and regularization.

The dual problem, derived via convex duality, reads

$\max_{f \in \mathbb{R}^n,\, g \in \mathbb{R}^m} -\langle a, f \rangle - \langle b, g \rangle - \varepsilon \sum_{i,j} e^{-(f_i + g_j) / \varepsilon} K_{ij},$

where the "Gibbs kernel" is $K_{ij} = \exp(-C_{ij} / \varepsilon)$ .

Optimality (KKT) conditions yield the "Gibbs form" solution for the plan: $P_{ij} = \exp\!\left(-\frac{f_i + g_j}{\varepsilon}\right) K_{ij}.$ The marginal constraints produce a nonlinear residual system $F(f, g) = 0$ , which is smooth under entropy regularization—unlike the nonregularized case (Brauer et al., 2017).

2. Solution Methods: Sinkhorn and Sinkhorn–Newton Algorithms

The foundational approach for solving entropic OT is the Sinkhorn–Knopp algorithm, an alternating matrix scaling procedure derived from the dual optimality conditions: $u^{k+1} = \operatorname{diag}(K v^k)^{-1} a, \quad v^{k+1} = \operatorname{diag}(K^T u^{k+1})^{-1} b,$ with the plan given by $P = \operatorname{diag}(u) K \operatorname{diag}(v)$ . This method enjoys linear convergence and low per-iteration cost ( $O(nm)$ ), as matrix-vector products are the computational bottleneck (Cuturi, 2013).

The Sinkhorn–Newton method generalizes this by applying a (damped) Newton step directly to $F(f, g) = 0$ . The Jacobian is a structured, symmetric, positive semidefinite matrix,

$J_F(f, g) = \frac{1}{\varepsilon} \begin{bmatrix} \operatorname{diag}(P 1_m) & P \ P^T & \operatorname{diag}(P^T 1_n) \end{bmatrix},$

and each iteration requires solving a linear system—commonly with conjugate gradient (CG)—followed by an update of potentials. Once in the neighborhood of the solution, the method achieves local quadratic convergence: $\|(f^{k+1}, g^{k+1}) - (f^*, g^*)\|_\infty \leq \omega \|(f^k, g^k) - (f^*, g^*)\|_\infty^2$ for some explicit $\omega > 0$ (Brauer et al., 2017).

Numerical results demonstrate that for small regularization ( $\varepsilon \approx 10^{-3}$ ), Sinkhorn–Newton dramatically outpaces classical Sinkhorn in both iteration count and wall time, especially as high-accuracy constraints dominate (Brauer et al., 2017).

3. Theoretical Guarantees: Existence, Uniqueness, and Convergence

The entropic regularized OT is strictly convex and always admits a unique smooth solution for any strictly positive marginals. The dual variables (potentials) are unique up to an additive constant, and the Gibbs form ensures positivity of the plan entries. Under mild assumptions (full support, reasonable cost matrix), both primal and dual attain their optima.

As $\varepsilon \to 0$ , the entropic optimal plan converges (in the narrow topology) to a classical Kantorovich optimizer. When the unregularized problem is non-unique (e.g., with non-strictly convex costs such as $\|x - y\|$ ), the entropic approximation selects a unique plan characterized by an additional entropy-minimization criterion on each transport ray, as established in the entropic selection principle (Aryan et al., 22 Feb 2025).

For the quadratic cost, the control of approximation bias and regularization error is rigorous: one observes

$0 \leq W_\varepsilon(P,Q) - W_0(P,Q) \leq 2d\,\varepsilon\,\log\left(\frac{8e^2 R^2}{\sqrt{d}\,\varepsilon}\right)$

and the error decays as $O(\varepsilon \log(1/\varepsilon))$ for compact supports (Bigot et al., 2022).

4. Statistical and Computational Implications

Entropic regularization enables profound gains in computational efficiency by transforming an LP with $O(n^3)$ worst-case complexity into a strictly convex problem solvable via matrix scaling with overall complexity $O(n^2 / \varepsilon^2)$ , or even less in practice for moderate $\varepsilon$ (Cuturi, 2013).

From a statistical viewpoint, entropic OT acts as a smoothing operator for Wasserstein estimators. As established in (Bigot et al., 2022):

The variance of plug-in estimators decreases substantially compared to classical OT.
The balance of approximation and estimation errors yields minimax-type rates $n^{-2/d}$ for appropriate choices of $\varepsilon = n^{-1/(d+2)}$ .
Empirical studies show similar estimation accuracy can be achieved at significantly lower computational cost, e.g., 5–10× speedups on moderate $n$ and $d$ .
Small $\varepsilon$ reduces bias but slows convergence, while large $\varepsilon$ improves speed and variance at the expense of higher approximation bias.

Recent analysis for Gaussian marginals provides closed-form bias and scaling guidelines, directly quantifying the statistical–computational tradeoff and justifying adaptive selection of $\varepsilon$ (Barrio et al., 2020).

5. Numerical Implementation and Practical Guidelines

The standard implementation initializes potentials to zero (so $P^0 = K$ ), iteratively applies the scaling, and stops when violations of the marginal constraints or cost improvement drops below tolerance.

Regularization parameter $\varepsilon$ should be chosen as a small fraction of a typical cost scale (e.g., median or mean of $C_{ij}$ ).
For large-scale problems, exploit structure in $K$ (e.g., convolutional structure on grids) to reduce storage and computational cost (Brauer et al., 2017).
In the Sinkhorn–Newton method, use preconditioned CG linear solves with diagonal preconditioners (Brauer et al., 2017).
For massive $n$ , dual-only CG and memory-efficient variants can be used to avoid explicit storage of coupling matrices (Brauer et al., 2017).

The following table summarizes iteration complexity and convergence behavior of the two principal methods:

Algorithm	Convergence	Iteration Cost	Convergence Rate
Sinkhorn–Knopp	Linear	$O(nm)$	$O(1/\varepsilon^2)$ iterations
Sinkhorn–Newton	Quadratic	$O(n^2)$	Quadratic (local)

Quadratic convergence is observed for the Sinkhorn–Newton method once iterates are close to the solution and under nondegeneracy conditions on the optimal coupling (Brauer et al., 2017).

Practical implementations should terminate when $\max\{\|P^k 1_m - a\|_\infty,\, \|{P^k}^T 1_n - b\|_\infty\}$ or the objective increment is below a pre-specified tolerance (Brauer et al., 2017).

6. Robustness, Scaling, and Mesh-Independence

Empirical analysis demonstrates that the Sinkhorn–Newton method is robust with respect to mesh size in discretized OT problems:

On 1D grids up to $n = 8000$ , the number of CG steps per Newton iteration remains almost constant.
The overall computation time grows as $O(n^2)$ , determined mainly by dense matrix-vector multiplies, showing mesh-independence in convergence once the system size is large enough (Brauer et al., 2017).
For very small $\varepsilon$ , both Sinkhorn–Knopp and Sinkhorn–Newton require more iterations, but Sinkhorn–Newton's quadratic convergence allows it to outperform classical scaling in high-accuracy or low-regularization settings.

A plausible implication is that the Sinkhorn–Newton method is particularly advantageous in applications where $\varepsilon$ must be small for fidelity reasons, or where high-precision solutions are demanded.

7. Broader Impacts and Selection Principle

Entropic regularization does more than approximate classical OT: it canonically selects among multiple optimal plans in settings with nonunique solutions, picking the one maximizing entropy relative to a carefully constructed reference measure on each transport ray (Aryan et al., 22 Feb 2025). The resulting plan is unique, admits a precise variational characterization, and sharpens the understanding of the zero-noise limit and selection phenomena in OT.

Furthermore, this regularization is foundational for modern scalable OT computations, forming the basis of neural and stochastic estimator pipelines, statistical learning applications, and high-dimensional inference tasks (Wang et al., 2024, Bigot et al., 2022, Barrio et al., 2020).

References

"A Sinkhorn-Newton method for entropic optimal transport" (Brauer et al., 2017)
"Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances" (Cuturi, 2013)
"On the potential benefits of entropic regularization for smoothing Wasserstein estimators" (Bigot et al., 2022)
"Entropic Selection Principle for Monge’s Optimal Transport" (Aryan et al., 22 Feb 2025)
"The statistical effect of entropic regularization in optimal transportation" (Barrio et al., 2020)