Papers
Topics
Authors
Recent
2000 character limit reached

Entropy-Regularized Linear Programming

Updated 20 November 2025
  • The paper introduces an entropy-regularized LP that adds a negative Shannon penalty to smooth the feasible region and enforce strict convexity.
  • It demonstrates exponential convergence to LP optima with non-asymptotic error bounds achieved via properties like weak convexity and the Sinkhorn iteration.
  • The approach underpins scalable algorithms for optimal transport and machine learning, balancing computational efficiency with trade-offs between accuracy and runtime.

An entropy-regularized linear programming approach augments a standard linear program (LP) with a negative Shannon entropy penalty. This method smooths the polyhedral feasible region, leads to strictly convex objectives, and critically underpins the scalability of algorithms for optimal transport and related large-scale optimization in machine learning. At its core, an entropy penalty enables exponentially fast, quantifiable convergence to LP optima while admitting algorithmic strategies (e.g., Sinkhorn iteration) with favorable computational and parallelization properties. This framework also provides non-asymptotic explicit error bounds, elucidates sharp trade-offs between accuracy and computational effort, and demonstrates fundamental limits on the achievable complexity for certain combinatorial LPs such as the assignment problem (Weed, 2018).

1. Classical Linear Programs and Entropic Penalties

A standard LP in minimization form is

(LP)minx0  cxsubject to  Ax=b,\text{(LP)}\quad \min_{x \geq 0} \; c^\top x \quad \text{subject to} \; A x = b,

where P:={x0:Ax=b}P := \{x \geq 0 : A x = b\} is assumed bounded and nonempty, and cxc^\top x is not constant on PP. The entropy-regularized variant introduces a negative Shannon entropy penalty,

ent(x):=ixilog(1/xi),\mathrm{ent}(x) := \sum_i x_i \log(1/x_i),

with regularization parameter η>0\eta > 0, transforming the objective to

(Pen)minx  Fη(x):=cxη1ent(x)subject to  Ax=b.\text{(Pen)}\quad \min_x \; F_\eta(x) := c^\top x - \eta^{-1} \mathrm{ent}(x) \quad \text{subject to} \; A x = b.

As η\eta \rightarrow \infty, the penalty vanishes, recovering the original LP. For moderate η\eta, strong convexity of the entropic term facilitates efficient algorithms, notably the Sinkhorn method.

2. Quantitative Error Bounds and Exponential Convergence

Let f=minxPcxf^* = \min_{x \in P} c^\top x and fη=minxPFη(x)f_\eta^* = \min_{x \in P} F_\eta(x). To relate the entropic and original optima, define the suboptimality gap Δ:=min{cvcv:vVertices(P),v optimal vertex}\Delta := \min\{c^\top v - c^\top v^* : v \in \mathrm{Vertices}(P), v^* \text{ optimal vertex}\}, the 1\ell_1-radius R1:=maxxPx1R_1 := \max_{x \in P}\|x\|_1, and the entropic radius RH:=maxx,yP(ent(x)ent(y))R_H := \max_{x, y \in P}(\mathrm{ent}(x) - \mathrm{ent}(y)).

A non-asymptotic convergence theorem establishes

fηfCexp(κη),κ=Δ/R1,C=Δexp((R1+RH)/R1),f_\eta^* - f^* \leq C \exp(-\kappa \eta), \qquad \kappa = \Delta/R_1, \quad C = \Delta \exp\big((R_1 + R_H)/R_1\big),

valid for any LP. Explicitly, if η(R1+RH)/Δ\eta \geq (R_1 + R_H)/\Delta,

fηfΔexp[η(Δ/R1)+(R1+RH)/R1].f_\eta^* - f^* \leq \Delta \exp \left[-\eta(\Delta/R_1) + (R_1 + R_H)/R_1 \right].

The proof decomposes xηx^\eta as a convex combination of optimal and suboptimal vertices and uses weak convexity properties of entropy. Notably, the exponential rate is optimal and matching lower bounds exist: for a rescaled simplex, with P={x0:ixi=β},c0=0,c1..d1=αP = \{ x \geq 0 : \sum_i x_i = \beta\},\,c_0 = 0, c_{1..d-1} = \alpha, the rate exp(ηΔ/R1)\exp(-\eta \Delta/R_1) is tight up to constants. No improvement is possible in the dependencies on Δ,R1,RH\Delta, R_1, R_H (Weed, 2018).

3. Limitation: Assignment Problem and Complexity Barriers

Consider the n×nn \times n assignment (minimum-cost perfect matching) LP: minX0C,Xsubject to X1=1,X1=1.\min_{X \geq 0} \langle C, X \rangle \quad \text{subject to } X 1 = 1, X^\top 1 = 1. The Birkhoff polytope then has R1=n,RH=nlogn,Δ1R_1 = n, R_H = n\log n, \Delta \geq 1. The exponential convergence theorem implies that to reach ϵ\epsilon-objective accuracy, one must set

ηnlog(1/ϵ)+n(1+logn)=O(nlog(n/ϵ)).\eta \gtrsim n \log(1/\epsilon) + n(1 + \log n) = O(n \log(n/\epsilon)).

Sinkhorn-type algorithms require O(n2Cη)O(n^2 \|C\|_\infty \eta) time per run, resulting in a total complexity of O(n3log(n/ϵ))O(n^3 \log(n/\epsilon)), which precludes the existence of a near-linear time (O(n2)O(n^2)) approximation scheme for the assignment problem by entropy-regularized means alone. Furthermore, if ηn\eta \ll n, recovery of even a constant-factor approximate assignment is impossible (Weed, 2018).

4. Methodological Components and Key Lemmas

The analysis leverages several fundamental properties of the entropy function. Key results include:

  • Weak convexity: For any nonnegative x,yx, y, and λ[0,1]\lambda \in [0, 1],

ent(λx+(1λ)y)λent(x)+(1λ)ent(y)+max{x1,y1}h(λ),\mathrm{ent}(\lambda x + (1-\lambda) y) \leq \lambda\,\mathrm{ent}(x) + (1-\lambda)\,\mathrm{ent}(y) + \max\{\|x\|_1, \|y\|_1\} h(\lambda),

with h(λ)=λlogλ(1λ)log(1λ)h(\lambda) = -\lambda \log \lambda - (1-\lambda) \log (1-\lambda).

  • Monotonicity and scalar bounds on the binary entropy facilitate the derivation of sharp fixed-point bounds for the convex combination weights in the optimality analysis.
  • The analysis holds uniformly for arbitrary LPs, not just for specific instances such as transport polytopes.

These structural insights underpin both the exponential convergence rate and the necessity for large regularization in combinatorially complex LPs.

5. Practical Implications for Machine Learning and Optimal Transport

In large-scale optimal transport (OT) and machine learning, entropy regularization via the Sinkhorn algorithm is widely adopted for computational expedience on GPUs and other parallel hardware. However, exact objective accuracy requires

η=(R1/Δ)log(Δ/ϵ)+(R1+RH)/Δ.\eta = (R_1/\Delta)\log(\Delta/\epsilon) + (R_1 + R_H)/\Delta.

For OT problems on size nn, typically R1=1,Δ1/n,RHlognR_1 = 1, \Delta \approx 1/n, R_H \approx \log n, whence ηnlog(n/ϵ)\eta \approx n \log(n/\epsilon) is required. Small η\eta yields fast but biased solutions, while increasing η\eta reduces bias exponentially slowly. There is a trade-off: computationally efficient but approximate solutions when η\eta is modest, and high-precision solutions only at high computational cost. In coarse ML applications where approximate distances suffice, moderate entropic regularization is typically acceptable. For exact recovery or fine-grained OT, the exponential rate in η\eta governs achievable bias (Weed, 2018).

A critical lesson is that the entropic diameter (RHR_H) and the data-dependent condition number (R1/ΔR_1/\Delta) jointly determine the optimal choice of η\eta. Future work aims to precisely estimate the energy spectrum of near-optima (the distribution of Δ\Delta and RHR_H) for adaptive parameter tuning.

6. Summary and Outlook

The entropy-regularized LP approach yields exponentially fast, fully explicit convergence to LP optima across arbitrary problem instances, supported by sharp upper and lower bounds. There are foundational limitations for combinatorial LPs—for instance, the assignment problem cannot be solved in near-linear time by entropic smoothing alone. Nonetheless, the method underpins scalable algorithms for large-scale OT and machine learning, where practical trade-offs between accuracy and runtime must be balanced by tuning the regularization parameter in light of intrinsic geometric characteristics of the LP feasible region (Weed, 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Entropy-Regularized Linear Programming Approach.