Entropy-Regularized Linear Programming

Updated 20 November 2025

The paper introduces an entropy-regularized LP that adds a negative Shannon penalty to smooth the feasible region and enforce strict convexity.
It demonstrates exponential convergence to LP optima with non-asymptotic error bounds achieved via properties like weak convexity and the Sinkhorn iteration.
The approach underpins scalable algorithms for optimal transport and machine learning, balancing computational efficiency with trade-offs between accuracy and runtime.

An entropy-regularized linear programming approach augments a standard linear program (LP) with a negative Shannon entropy penalty. This method smooths the polyhedral feasible region, leads to strictly convex objectives, and critically underpins the scalability of algorithms for optimal transport and related large-scale optimization in machine learning. At its core, an entropy penalty enables exponentially fast, quantifiable convergence to LP optima while admitting algorithmic strategies (e.g., Sinkhorn iteration) with favorable computational and parallelization properties. This framework also provides non-asymptotic explicit error bounds, elucidates sharp trade-offs between accuracy and computational effort, and demonstrates fundamental limits on the achievable complexity for certain combinatorial LPs such as the assignment problem (Weed, 2018).

1. Classical Linear Programs and Entropic Penalties

A standard LP in minimization form is

$\text{(LP)}\quad \min_{x \geq 0} \; c^\top x \quad \text{subject to} \; A x = b,$

where $P := \{x \geq 0 : A x = b\}$ is assumed bounded and nonempty, and $c^\top x$ is not constant on $P$ . The entropy-regularized variant introduces a negative Shannon entropy penalty,

$\mathrm{ent}(x) := \sum_i x_i \log(1/x_i),$

with regularization parameter $\eta > 0$ , transforming the objective to

$\text{(Pen)}\quad \min_x \; F_\eta(x) := c^\top x - \eta^{-1} \mathrm{ent}(x) \quad \text{subject to} \; A x = b.$

As $\eta \rightarrow \infty$ , the penalty vanishes, recovering the original LP. For moderate $\eta$ , strong convexity of the entropic term facilitates efficient algorithms, notably the Sinkhorn method.

2. Quantitative Error Bounds and Exponential Convergence

Let $f^* = \min_{x \in P} c^\top x$ and $f_\eta^* = \min_{x \in P} F_\eta(x)$ . To relate the entropic and original optima, define the suboptimality gap $\Delta := \min\{c^\top v - c^\top v^* : v \in \mathrm{Vertices}(P), v^* \text{ optimal vertex}\}$ , the $\ell_1$ -radius $R_1 := \max_{x \in P}\|x\|_1$ , and the entropic radius $R_H := \max_{x, y \in P}(\mathrm{ent}(x) - \mathrm{ent}(y))$ .

A non-asymptotic convergence theorem establishes

$f_\eta^* - f^* \leq C \exp(-\kappa \eta), \qquad \kappa = \Delta/R_1, \quad C = \Delta \exp\big((R_1 + R_H)/R_1\big),$

valid for any LP. Explicitly, if $\eta \geq (R_1 + R_H)/\Delta$ ,

$f_\eta^* - f^* \leq \Delta \exp \left[-\eta(\Delta/R_1) + (R_1 + R_H)/R_1 \right].$

The proof decomposes $x^\eta$ as a convex combination of optimal and suboptimal vertices and uses weak convexity properties of entropy. Notably, the exponential rate is optimal and matching lower bounds exist: for a rescaled simplex, with $P = \{ x \geq 0 : \sum_i x_i = \beta\},\,c_0 = 0, c_{1..d-1} = \alpha$ , the rate $\exp(-\eta \Delta/R_1)$ is tight up to constants. No improvement is possible in the dependencies on $\Delta, R_1, R_H$ (Weed, 2018).

3. Limitation: Assignment Problem and Complexity Barriers

Consider the $n \times n$ assignment (minimum-cost perfect matching) LP: $\min_{X \geq 0} \langle C, X \rangle \quad \text{subject to } X 1 = 1, X^\top 1 = 1.$ The Birkhoff polytope then has $R_1 = n, R_H = n\log n, \Delta \geq 1$ . The exponential convergence theorem implies that to reach $\epsilon$ -objective accuracy, one must set

$\eta \gtrsim n \log(1/\epsilon) + n(1 + \log n) = O(n \log(n/\epsilon)).$

Sinkhorn-type algorithms require $O(n^2 \|C\|_\infty \eta)$ time per run, resulting in a total complexity of $O(n^3 \log(n/\epsilon))$ , which precludes the existence of a near-linear time ( $O(n^2)$ ) approximation scheme for the assignment problem by entropy-regularized means alone. Furthermore, if $\eta \ll n$ , recovery of even a constant-factor approximate assignment is impossible (Weed, 2018).

4. Methodological Components and Key Lemmas

The analysis leverages several fundamental properties of the entropy function. Key results include:

Weak convexity: For any nonnegative $x, y$ , and $\lambda \in [0, 1]$ ,

$\mathrm{ent}(\lambda x + (1-\lambda) y) \leq \lambda\,\mathrm{ent}(x) + (1-\lambda)\,\mathrm{ent}(y) + \max\{\|x\|_1, \|y\|_1\} h(\lambda),$

with $h(\lambda) = -\lambda \log \lambda - (1-\lambda) \log (1-\lambda)$ .

Monotonicity and scalar bounds on the binary entropy facilitate the derivation of sharp fixed-point bounds for the convex combination weights in the optimality analysis.
The analysis holds uniformly for arbitrary LPs, not just for specific instances such as transport polytopes.

These structural insights underpin both the exponential convergence rate and the necessity for large regularization in combinatorially complex LPs.

5. Practical Implications for Machine Learning and Optimal Transport

In large-scale optimal transport (OT) and machine learning, entropy regularization via the Sinkhorn algorithm is widely adopted for computational expedience on GPUs and other parallel hardware. However, exact objective accuracy requires

$\eta = (R_1/\Delta)\log(\Delta/\epsilon) + (R_1 + R_H)/\Delta.$

For OT problems on size $n$ , typically $R_1 = 1, \Delta \approx 1/n, R_H \approx \log n$ , whence $\eta \approx n \log(n/\epsilon)$ is required. Small $\eta$ yields fast but biased solutions, while increasing $\eta$ reduces bias exponentially slowly. There is a trade-off: computationally efficient but approximate solutions when $\eta$ is modest, and high-precision solutions only at high computational cost. In coarse ML applications where approximate distances suffice, moderate entropic regularization is typically acceptable. For exact recovery or fine-grained OT, the exponential rate in $\eta$ governs achievable bias (Weed, 2018).

A critical lesson is that the entropic diameter ( $R_H$ ) and the data-dependent condition number ( $R_1/\Delta$ ) jointly determine the optimal choice of $\eta$ . Future work aims to precisely estimate the energy spectrum of near-optima (the distribution of $\Delta$ and $R_H$ ) for adaptive parameter tuning.

6. Summary and Outlook

The entropy-regularized LP approach yields exponentially fast, fully explicit convergence to LP optima across arbitrary problem instances, supported by sharp upper and lower bounds. There are foundational limitations for combinatorial LPs—for instance, the assignment problem cannot be solved in near-linear time by entropic smoothing alone. Nonetheless, the method underpins scalable algorithms for large-scale OT and machine learning, where practical trade-offs between accuracy and runtime must be balanced by tuning the regularization parameter in light of intrinsic geometric characteristics of the LP feasible region (Weed, 2018).

PDF Markdown Chat (Pro)

References (1)

An explicit analysis of the entropic penalty in linear programming (2018)

Follow Topic

Get notified by email when new papers are published related to Entropy-Regularized Linear Programming Approach.