K-OT Regularization in Optimal Transport
- K-OT regularization is a method that incorporates convex constraints into the classical Kantorovich OT framework to enhance tractability and control sample complexity.
- It employs strategies like entropic, quadratic, and structural regularization to influence the solution’s sparsity, smoothness, and convergence speed.
- The approach is widely applied in domain adaptation, generative modeling, and metric learning, with algorithms such as Sinkhorn iterations and mirror descent.
Kantorovich Optimal Transport (K-OT) Regularization is a collection of methodologies that introduce structural regularization terms or constraints into the classical Kantorovich optimal transport framework. These regularizations serve to ensure tractability, overcome statistical and computational obstacles, promote desired plan properties (such as sparsity or smoothness), and tightly control the sample complexity of Wasserstein distance estimation, especially in high-dimensional regimes. K-OT regularization underlies widely used solvers and enables extensions to new application domains in data science and machine learning.
1. The Classical Kantorovich OT Framework
Classical (unregularized) Kantorovich OT seeks a transport plan between two probability measures and on Polish spaces , minimizing a cost functional: where enforces marginal constraints. For on , the $2$-Wasserstein distance arises. Solutions can be highly singular (plans supported on low-dimensional sets) and are known to suffer from the curse of dimensionality, requiring samples for empirical approximation (Paty et al., 2019).
The dual incorporates potentials 0: 1 Brenier's theorem shows that for 2, the optimal map is the gradient of a convex "Brenier potential" 3: 4.
2. Regularization Strategies in Kantorovich OT
Regularization modifies the original problem by penalizing the transport plan 5 (or the dual potentials) via convex functionals, constraints, or entropic terms. Key motivations are: removing singularity, accelerating algorithms, ensuring existence/uniqueness, promoting sparsity or smoothness, and improving generalization with finite samples.
2.1 Entropic Regularization
The most widely used approach adds negative Shannon entropy: 6 Yielding strictly convex objectives and unique, diffuse solutions, this leads to the celebrated Sinkhorn algorithm (matrix scaling in the discrete case) (Peyré, 10 May 2025, Peyré et al., 2018, Tupitsa et al., 2022). The Gibbs kernel 7 enables fast update rules, with per-iteration cost 8 in the discrete 9 case: 0 Algorithmic and theoretical studies (Tupitsa et al., 2022, Kassraie et al., 2024) show:
- 1 recovers unregularized OT (but yields ill-condition numerics), 2 gives maximal-entropy (independent coupling).
- Faster convergence and improved sample complexity (polynomial in 3) compared to the unregularized setting.
2.2 Other Strongly Convex Regularizers
The regularization term can be any strictly convex 4: 5 Quadratic regularization (6) induces sparsity in the optimal plan (Lorenz et al., 2019, Lorenz et al., 2019), unlike entropic regularization, which produces fully dense couplings. Quadratic-regularized OT admits efficient quasi-Newton and block coordinate algorithms, and closed semismooth duals (piecewise linear-quadratic) leading to rapid convergence for moderate problem sizes.
Regularizers based on Orlicz norms (Young's functions 7) generalize this framework, include the entropy and various power-like penalties, and admit Fenchel duals and theoretical guarantees (Lorenz et al., 2019, Lorenz et al., 2020).
2.3 Regularization via Structural Constraints
Imposing explicit structural constraints further shapes the solution:
- Cardinality constraints: Control the support of 8 directly (e.g., at most 9 nonzeros per column) (Liu et al., 2022). This enables precise plan sparsity, crucial for applications to mixture-of-experts, clustering, and routing.
- Strong convexity and smoothness of the Brenier potential: By enforcing 0-strong convexity and 1-smoothness (Hessian bounded between 2 and 3), one regularizes the Monge map, ensuring robust estimation and controlled Lipschitz constants (Paty et al., 2019).
2.4 Divergence Regularization
Recent works interpolate between different penalties:
- Rényi. The 4-Rényi divergence regularizes OT, with 5 yielding KL/entropic regularization and 6 recovering classical OT. This interpolates smoothly between sparse and dense plans, often out-performing KL and Tsallis regularizations (Bresch et al., 2024).
- Quantum entropy. In tensor-valued OT, the von Neumann entropy regularizes the coupling; the resulting quantum Sinkhorn algorithm generalizes classical entropic regularization to the nonscalar, PSD-matrix regime (Peyré et al., 2016).
3. Dual Formulations and Algorithms
The introduction of regularization modifies both the primal and the dual: 7
8
where 9 is the Fenchel conjugate. For entropic regularization, this dual is smooth and high-dimensional, enabling the use of block coordinate ascent, Nesterov acceleration, and mirror descent (An et al., 2021, Tupitsa et al., 2022, Bresch et al., 2024). For quadratic regularization, the dual is strongly concave and admits efficient semismooth Newton and coordinate-descent solvers (Lorenz et al., 2019, Lorenz et al., 2019).
Structural regularizations, e.g., on plan sparsity, nonconvex 0-type constraints, still lead to dual/semi-dual problems that are tractable, often via closed-form projections onto the constraint set (Liu et al., 2022).
In high-dimensional continuous settings, regularization on the Brenier potential leads to quadratic-constrained quadratic programs (QCQPs) separable across spatial clusters/block partitions (Paty et al., 2019).
4. Theoretical Guarantees and Sample Complexity
Regularization profoundly alters statistical properties:
- Existence and uniqueness: Strict convexity/strong regularization yield unique solutions even when classical OT admits multiple (possibly singular) optima (Clason et al., 2019, Lorenz et al., 2019, Lorenz et al., 2020).
- 1-convergence: As the regularization vanishes, regularized solutions converge to the unregularized Kantorovich plan (Lorenz et al., 2020, Clason et al., 2019).
- Distortion and regularity bounds: Regularization on the Brenier map enforces Lipschitz constants, thus bounding distortion in high dimension (Paty et al., 2019).
- Sample complexity: Regularized estimators (notably, with smooth strongly convex potentials or entropy) admit polynomial (in 2) convergence rates, avoiding the curse of dimensionality (Paty et al., 2019, Kassraie et al., 2024, Tupitsa et al., 2022).
- Interpolation properties: Rényi and Tsallis-regularized OT frameworks admit rigorous interpolation between unregularized OT and the entropy-regularized regime, with explicit control of plan structure by a divergence parameter (Bresch et al., 2024).
5. Algorithmic Implementations and Computational Complexity
K-OT regularization enables scalable algorithms:
- Sinkhorn iteration: The mainstay for entropic OT, converges linearly and is massively parallelizable; cost per iteration 3 (Peyré, 10 May 2025, Peyré et al., 2018).
- Mirror descent and accelerated methods: Useful for smooth regularizers, e.g., with quadratic penalty or Tsallis/Rényi divergence (An et al., 2021, Bresch et al., 2024).
- QCQP solvers: Required for smooth strongly-convex Brenier potential regularization; block-separable in practice (Paty et al., 2019).
- Sparse and structured plans: K-OT with sparsity constraints admits dual and semi-dual forms where the gradient computation requires only top-4 selection per column, scaling as 5 per iteration (Liu et al., 2022).
- Distributed and barycentric extensions: Strong regularization enables efficient decentralized and parallel barycenter algorithms (Tupitsa et al., 2022).
Tables of regularizer properties, dual forms, and algorithmic complexity are available in the cited literature.
6. Applications and Empirical Performance
Kantorovich OT regularization is critical in modern machine learning and statistics:
- Domain adaptation: Regularized Monge map estimation achieves superior out-of-sample domain alignment and transfer accuracy (Paty et al., 2019).
- Generative modeling: Sinkhorn divergences and regularized Wasserstein losses are used widely in GANs, VAEs, and latent-variable models (Peyré, 10 May 2025).
- Sparse assignment problems: K-OT with explicit cardinality constraints excels in mixture-of-experts models and computational vision tasks (Liu et al., 2022).
- Robustness and metric learning: The ground cost-adversarial interpretation shows that regularization endows OT distances with intrinsic robustness to metric perturbations, critical for adversarial defense and distributional robustness (Paty et al., 2020).
- Quantile and ranking operations: Differentiable sorting/ranking leverages K-OT regularization for gradient-based pipelines (Cuturi et al., 2019).
- Tensor field and geometric data: Quantum-OT enables OT with matrix-valued data, powering geometric morphing, barycenters, and biomedical imaging (Peyré et al., 2016, Bercu et al., 2024).
Empirical benchmarks consistently show regularization improves both sample efficiency and algorithmic speed, with the plan's structure and regularity tunable to application needs.
7. Parameter Selection and Practical Recommendations
Regularization parameters (6, 7, 8, 9, 0, or 1 as appropriate) must be chosen to balance fidelity, numerical stability, computational speed, and statistical accuracy:
- Entropic regularization: Choose 2 as a small fraction of the mean transport cost; decrease for higher fidelity but at increased computational cost (Tupitsa et al., 2022).
- Quadratic/3: Cross-validate 4 to match model complexity to sample size and desired sparsity.
- Structural bounds: For smoothness/convexity (5) or support (6), use prior knowledge or hold-out risk minimization (Paty et al., 2019, Liu et al., 2022).
- Rényi divergence: 7–8 yields tighter OT plans while maintaining numerical stability (Bresch et al., 2024).
- Continuation/annealing: Solve for large regularization, then gradually decrease (warm start) (Tupitsa et al., 2022, Kassraie et al., 2024).
Choice of parameters and regularizer is application-specific, with cross-validation or debiased divergences (e.g., Sinkhorn) providing principled guidance.
References:
- "Regularity as Regularization: Smooth and Strongly Convex Brenier Potentials in Optimal Transport" (Paty et al., 2019)
- "Quadratically regularized optimal transport" (Lorenz et al., 2019)
- "Orlicz-space regularization for optimal transport and algorithms for quadratic regularization" (Lorenz et al., 2019)
- "Interpolating between Optimal Transport and KL regularized Optimal Transport using Rényi Divergences" (Bresch et al., 2024)
- "Sparsity-Constrained Optimal Transport" (Liu et al., 2022)
- "Optimal Transport for Machine Learners" (Peyré, 10 May 2025)
- "Computational Optimal Transport" (Peyré et al., 2018)
- "Entropic regularization of continuous optimal transport problems" (Clason et al., 2019)
- "Regularized Optimal Transport is Ground Cost Adversarial" (Paty et al., 2020)
- "Quantum Optimal Transport for Tensor Field Processing" (Peyré et al., 2016)
- "Imaging with Kantorovich-Rubinstein discrepancy" (Lellmann et al., 2014)
- "Numerical Methods for Large-Scale Optimal Transport" (Tupitsa et al., 2022)
- "Efficient Optimal Transport Algorithm by Accelerated Gradient descent" (An et al., 2021)
- "Regularized estimation of Monge-Kantorovich quantiles for spherical data" (Bercu et al., 2024)
- "ENOT: Expectile Regularization for Fast and Accurate Training of Neural Optimal Transport" (Buzun et al., 2024)
- "Progressive Entropic Optimal Transport Solvers" (Kassraie et al., 2024)
- "Differentiable Ranks and Sorting using Optimal Transport" (Cuturi et al., 2019)