Entropic Regularisation in Optimization

Updated 15 January 2026

Entropic regularisation is a technique that adds a Shannon entropy term to an optimization objective, encouraging smooth and diffuse solutions.
It transforms complex, combinatorial problems into strictly convex ones, enabling efficient algorithms like the Sinkhorn algorithm in optimal transport.
The method improves solution regularity and robustness in applications ranging from statistical estimation and inverse problems to deep learning.

Entropic regularisation refers to the addition of a (Shannon) entropy term to the objective of an optimization problem. Fundamentally, this term biases solutions toward higher entropy—that is, more “diffuse” or “spread out”—by penalizing highly localized, sharply peaked, or low-entropy solutions. Over the past decade, entropic regularisation has become central in computational optimal transport, statistical machine learning, inverse problems, numerical PDEs, and robust optimization, due to its smoothing, convexifying, and algorithmically beneficial properties.

1. Mathematical Definition and General Principles

Given a constrained optimisation problem, such as linear programming, optimal transport, or statistical estimation, entropic regularisation typically transforms the objective by adding a negative entropy contribution. For a measure or density $\pi$ with respect to a base measure $\nu$ , the negative (relative) entropy is

$H(\pi \mid \nu) = \int \log\left(\frac{d\pi}{d\nu}\right)\, d\pi$

where $\pi \ll \nu$ . For couplings in optimal transport or distributions in statistical learning, the regularised problem reads

$\min_{\pi \in \mathcal{C}} \Bigg[ \int c\, d\pi + \varepsilon H(\pi \mid \nu) \Bigg]$

where $c$ is typically a cost, $\mathcal{C}$ the constraint set (e.g., coupling constraints or marginal constraints), and $\varepsilon>0$ the regularisation parameter. The entropy makes the objective strictly convex in $\pi$ , admits unique solutions, and often enables efficient fixed-point or matrix-scaling algorithms.

The key effect is to bias $\pi$ toward absolutely continuous, high-entropy (less deterministic) solutions, distinguishing it from, e.g., $\ell_2$ or $\ell_1$ regularisers.

2. Computational Optimal Transport and Sinkhorn Algorithm

Entropic regularisation fundamentally alters the structure of the optimal transport (OT) problem. Whereas classical OT is computationally expensive due to its combinatorially hard linear program, the entropic-regularised version is strictly convex and admits efficient block coordinate ascent solvers:

Let $\mu,\nu$ be probability distributions on spaces $X$ and $Y$ , and $c(x,y)$ a cost. The entropic-regularised optimal transport reads

$\min_{\pi \in \Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y) + \varepsilon H(\pi \mid \mu \otimes \nu)$

This can be dualized to a log-sum-exp form and solved efficiently by the Sinkhorn algorithm (also known as iterative proportional fitting or matrix scaling). For discrete problems with dense cost matrices of size $n$ , the computational complexity per iteration is $O(n^2)$ , but each step can be highly parallelized and achieves linear (geometric) convergence in the Hilbert projective metric under mild positivity assumptions (Reshetova et al., 2021 Sturmfels et al., 2022).

The significance is profound in large scale optimal transport, computer vision, generative modeling, and statistics, making previously intractable problems practically solvable.

3. Theoretical Analysis and Regularisation Effects

Entropic regularisation induces smoothness, strict convexity, and improved regularity of minimizers. In optimal transport, the entropy regularizer ensures unique minimizers $\pi_\varepsilon$ of the Schrödinger type:

$\pi_\varepsilon(dx\, dy) = \exp\left( \frac{\varphi(x) + \psi(y) - c(x, y)}{\varepsilon} \right) \mu(dx)\, \nu(dy)$

where $(\varphi, \psi)$ are dual potentials (Schrödinger potentials). This structure supports both theoretical and computational advances:

Regularity: The minimizer $\pi_\varepsilon$ is analytic whenever $(\mu, \nu)$ are smooth, as shown in the large-scale $\varepsilon$ -regularity theory (Gvalani et al., 13 Jan 2025), and in global regularity for quadratic costs (Gozlan et al., 20 Jan 2025).
$\Gamma$ -convergence: As $\varepsilon \to 0$ , the entropic problem $\Gamma$ -converges to the unregularised Kantorovich problem, and the minimizers converge accordingly (Clason et al., 2019).
Selection Principle: In cases where the unregularised problem has multiple solutions, entropic regularisation selects the most “diffuse” among all minimizers, as in the 1D Monge problem (Marino et al., 2017).

In statistical learning (e.g., GANs), entropic regularisation induces low-rank structure (soft-thresholding of eigenvalues), and can bias learned generators toward sparse solutions (Reshetova et al., 2021).

4. Algorithmic and Numerical Implications

The main computational benefit is tractability: entropic regularisation transforms hard combinatorial problems into smooth convex optimizations solvable by iterative scaling. The Sinkhorn algorithm and its block-coordinate or accelerated variants (e.g., greenkhorn) dominate modern practice in large-scale OT (Sturmfels et al., 2022).

For more complex problems (e.g., unbalanced transport, robust OT), entropic regularisation can be combined with marginal or joint entropy penalties, leading to strongly convex objectives and increased robustness to noise or outliers (Dahyot et al., 2019 Buze et al., 2023). In quantum settings, the entropic regulariser is extended to the von Neumann entropy, and quantum Sinkhorn-type algorithms converge under operator scaling (Portinale, 2023).

In deep learning, local or weight space entropic regularisation smooths the loss landscape, encouraging exploration of robust or flat minimizers (Musso, 2021). In adversarial training, entropic regularisation over the space of input perturbations replaces the hard inner maximization by a soft Gibbs expectation, biasing the model toward “robust valleys” (Jagatap et al., 2020).

5. Extensions across Domains

a. Unbalanced and Generalized OT

Entropic regularisation is extended to unbalanced optimal transport, where the marginal constraints are softened and mass can be created or destroyed. Two paradigms exist—“X-space” and “Y-space”—with the latter preferred due to superior convergence and dynamic properties (Buze et al., 2023 Sturmfels et al., 2022).

b. Regularisation in Variational and PDE Methods

In gradient flow and PDEs, entropic regularisation enables the extension of JKO schemes to non-gradient systems (Adams et al., 2021). It also permits the incorporation of complex nonlinearities and degenerate diffusions, and the solution of variational incremental problems in ill-posed inverse settings (Burger et al., 2019).

c. Machine Learning and Signal Processing

Entropic regularisation is integral to:

GANs: For OT- or Sinkhorn GANs, entropy penalization both regularizes and accelerates learning, overcoming the curse of dimensionality and stabilizing gradients (Reshetova et al., 2021).
Feature Diversification: Regularisation of feature-space entropy prevents collapse and improves retention of “fine-grain” information (Baena et al., 2022).
Policy Optimization in RL: KL-divergence or $f$ -divergence based penalties provide stable, trust-region style policy improvements, tightly connected to actor–critic architectures (Belousov et al., 2019 Islam et al., 2019).

6. Analytical, Geometric, and Algorithmic Perspectives

On the geometric side, entropic regularisation leads to a toric geometric viewpoint: solutions trace an “entropic path” on the intersection of polytopes and scaled toric varieties, interpolating between the Birch (maximum entropy) point and the LP optimum as $\varepsilon$ decreases (Sturmfels et al., 2022). In noncommutative settings, quantum entropic regularisation admits analogous duality and scaling structures, albeit with operator-theoretic complexities (Portinale, 2023).

Entropic regularisation is also a key tool for smoothing or “thermalising” rate-independent processes, introducing effective, strictly convex dissipation and unique timescales (Sullivan et al., 2012). The interplay between smoothing, convexity, and algorithmic accessibility is a recurrent theme across domains.

7. Parameter Selection, Limit Behavior, and Open Challenges

The regularisation parameter $\varepsilon$ (or $\gamma$ in Gibbs measures) controls the trade-off between fidelity and smoothing:

$\varepsilon \to 0$ recovers the original (possibly non-unique, non-smooth) optimal solution.
Finite $\varepsilon$ yields smooth, regularized, and computationally tractable problems at the expense of an entropic bias.
In applications, $\varepsilon$ is tuned to balance bias, variance, and computational efficiency, and must be chosen with care to avoid over-smoothing or numerical underflow, especially in large-scale or ill-conditioned problems (Clason et al., 2019 Adams et al., 2021).

Open challenges include theoretically optimal parameter selection, extension beyond Shannon entropy (e.g., to Rényi or Tsallis), and understanding entropic regularisation in the infinite-dimensional or nonconvex regime.

Entropic regularisation has proven to be a foundational technique for rendering high-dimensional optimization problems computationally feasible, while also ensuring unique solutions with desirable analytic and statistical properties. Its unifying role across optimal transport, inverse problems, machine learning, and dynamical systems continues to drive both theoretical and practical advances in applied mathematics and computational sciences.