Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rot Mover’s Distance: Regularized OT

Updated 21 April 2026
  • Rot Mover’s Distance (RMD) is a generalized optimal transport metric that integrates smooth convex regularization to modulate plan smoothness and sparsity.
  • It bridges classical Earth Mover’s Distance with minimal-regularized couplings by leveraging Bregman divergences and iterative projection algorithms like ASA and NASA.
  • Empirical results, including applications in audio-scene classification, highlight RMD’s potential in enhancing OT-based kernel methods for pattern recognition.

The Rot Mover’s Distance (RMD) is a generalization of the classic Earth Mover’s Distance (EMD) within the framework of discrete optimal transport. RMD augments the standard transport problem by introducing a smooth convex regularization penalty on the joint transport plan, yielding a new class of metrics rooted in matrix nearness with respect to Bregman divergences. This construction enables the interpolation between classical EMD and minimal-regularized couplings, where the nature of regularization controls plan smoothness, sparsity, or other desired structure. RMD recovers established methods such as Sinkhorn–Knopp for entropic regularization and extends to a wide spectrum of regularizers and induced divergences, with efficient algorithms tailored to the structure of each case (Dessein et al., 2016).

1. Mathematical Formulation: Primal and Dual RMD

Given probability vectors a,bΣda, b\in \Sigma_d in the dd-simplex, a nonnegative cost matrix CR+d×dC\in\mathbb{R}_+^{d\times d}, and a convex, smooth regularizer Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}, RMD is formulated on the transport polytope

U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.

  • Primal (Constrained) Formulation:

Given an "allowance" α0\alpha\geq 0 for the regularizer,

dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,

where XX^{\star} solves minXU(a,b)Ω(X)\min_{X\in U(a,b)} \Omega(X).

  • Dual (Penalized) Formulation:

Introducing a Lagrange parameter λ0\lambda\geq 0,

dd0

For dd1 below a threshold, there exists a unique dd2 with dd3 so that the primal and dual minimizers coincide. Classical EMD is recovered as dd4, while dd5 yields the minimal-dd6 coupling (Dessein et al., 2016).

2. Bregman-Projection Matrix-Nearness Equivalence

Let dd7 be the Bregman-type information regularizer generated by a convex function dd8, either separable (dd9) or general. The Fenchel conjugate CR+d×dC\in\mathbb{R}_+^{d\times d}0 induces the Bregman divergence: CR+d×dC\in\mathbb{R}_+^{d\times d}1 The dual RMD problem equivalently minimizes the Bregman divergence

CR+d×dC\in\mathbb{R}_+^{d\times d}2

where CR+d×dC\in\mathbb{R}_+^{d\times d}3 is obtained by unconstrained minimization of CR+d×dC\in\mathbb{R}_+^{d\times d}4. For the entropic regularizer, CR+d×dC\in\mathbb{R}_+^{d\times d}5, yielding a Kullback–Leibler projection interpretation. This general Bregman-projection framework enables the use of projection algorithms for regularized OT (Dessein et al., 2016).

3. Iterative Bregman Projection Algorithms: ASA and NASA

Efficient solution of the RMD projection is based on iterative Bregman projections, with two principal algorithmic frameworks determined by the domain of the regularizer CR+d×dC\in\mathbb{R}_+^{d\times d}6:

Used when CR+d×dC\in\mathbb{R}_+^{d\times d}7 does not enforce CR+d×dC\in\mathbb{R}_+^{d\times d}8. Dykstra's algorithm augments alternate Bregman projections with correction variables, cycling through nonnegativity, row-sum, and column-sum constraints. Newton–Raphson solves the one-dimensional per-row and per-column projection equations when CR+d×dC\in\mathbb{R}_+^{d\times d}9 is separable. Each iteration maintains correction vectors to ensure convergence.

  • Alternate Scaling Algorithm (ASA):

Applied when Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}0's domain is contained in Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}1, so Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}2 is implicit. The method alternates between row-sum and column-sum Bregman projections with no correction variables. For separable Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}3, updates decouple into per-row and per-column monotone equations efficiently solved by Newton–Raphson.

Both schemes generalize the classical projection-on-convex-sets (POCS) framework and leverage the explicit structure of Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}4 and Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}5 for efficient updates (Dessein et al., 2016).

4. Regularizer Families and Induced Divergences

The RMD framework supports a broad gallery of convex regularizers Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}6 (see Table 1), each yielding a distinct Bregman divergence Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}7 and associated geometric and statistical properties.

Regularizer Type Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}8 Definition Induced Ω:Rd×dR{+}\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}9
Entropic (KL) U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.0 U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.1
Burg (Itakura–Saito) U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.2 U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.3
Fermi–Dirac U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.4 U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.5
U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.6 (quasi-norms, U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.7) U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.8
U(a,b)={XR+d×dX1=a,  XT1=b}.U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.9 (norms, α0\alpha\geq 00) α0\alpha\geq 01
Euclidean (α0\alpha\geq 02) α0\alpha\geq 03 α0\alpha\geq 04
Hellinger-type α0\alpha\geq 05
Mahalanobis (quadratic form) α0\alpha\geq 06

Notably, the framework recovers Sinkhorn–Knopp scaling for KL entropic regularization (where α0\alpha\geq 07 and Newton projections become matrix rescalings), but allows for fundamentally different plan structures, smoothing, and sparsification depending on the regularizer choice (Dessein et al., 2016).

5. Empirical Properties and Algorithmic Considerations

RMD exhibits several algorithmic and empirical characteristics:

  • Interplay of α0\alpha\geq 08 and α0\alpha\geq 09:

Varying the regularizer parameter dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,0 yields a continuous interpolation between sharply optimal (EMD-like) and highly regularized transport plans. The geometry, such as anisotropy or smoothness, is strongly modulated by the nature of dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,1.

  • Computational Complexity:

For moderate dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,2, Newton subproblems within ASA/NASA scale as dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,3 or dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,4 per projection, with overall quadratic complexity per outer iteration. Sinkhorn–Knopp (KL) admits the fastest implementation; ASA is empirically faster than NASA due to the absence of correction variables.

  • Sparsity via Pruning:

Transport forbidden by dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,5 is handled by sparse extensions, simply excluding these indices from updates without affecting result correctness under broad conditions (Dessein et al., 2016).

6. Applications and Empirical Results

Synthetic experiments on two-mode densities demonstrate that tuning dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,6 and choosing different dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,7 yields qualitatively distinct mass redistributions, such as dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,8- versus dC,α,Ω(a,b)=minXU(a,b),Ω(X)Ω(X)+α  C,X,d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,9-like smoothing and various anisotropic effects. In audio-scene classification benchmarks (specifically, DCASE16), RMD-induced kernels—where each segment is encoded as a GMM over MFCCs, with OT ground cost given by pairwise Jeffrey-divergence—achieve superior or competitive accuracy to classical EMD-based SVM kernels. The Hellinger-type and certain XX^{\star}0 penalties (XX^{\star}1–XX^{\star}2) outperform classic Euclidean (XX^{\star}3) or Burg-IS regularizers in discriminative capacity. This suggests that fine-grained adjustment of XX^{\star}4 and XX^{\star}5 can significantly enhance OT-based kernel methods for pattern recognition and statistical tasks (Dessein et al., 2016).

7. Connections and Generalizations

RMD provides a principled interpolation and generalization over standard optimal transport, embedding the classic EMD, entropic regularization (Sinkhorn), and other divergences in a single algorithmic and theoretical scaffold. The Bregman-projection viewpoint enables leverage of convex duality and optimization-theoretic tools, including Newton–Raphson projection for separable XX^{\star}6, Dykstra’s algorithm for general convex settings, and efficient sparse extensions. The framework is compatible with a variety of regularizer classes encountered in machine learning and information geometry, supporting both spread-promoting and sparsity-inducing couplings according to analytic or empirical desiderata (Dessein et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rot Mover’s Distance (RMD).