Rot Mover’s Distance: Regularized OT

Updated 21 April 2026

Rot Mover’s Distance (RMD) is a generalized optimal transport metric that integrates smooth convex regularization to modulate plan smoothness and sparsity.
It bridges classical Earth Mover’s Distance with minimal-regularized couplings by leveraging Bregman divergences and iterative projection algorithms like ASA and NASA.
Empirical results, including applications in audio-scene classification, highlight RMD’s potential in enhancing OT-based kernel methods for pattern recognition.

The Rot Mover’s Distance (RMD) is a generalization of the classic Earth Mover’s Distance (EMD) within the framework of discrete optimal transport. RMD augments the standard transport problem by introducing a smooth convex regularization penalty on the joint transport plan, yielding a new class of metrics rooted in matrix nearness with respect to Bregman divergences. This construction enables the interpolation between classical EMD and minimal-regularized couplings, where the nature of regularization controls plan smoothness, sparsity, or other desired structure. RMD recovers established methods such as Sinkhorn–Knopp for entropic regularization and extends to a wide spectrum of regularizers and induced divergences, with efficient algorithms tailored to the structure of each case (Dessein et al., 2016).

1. Mathematical Formulation: Primal and Dual RMD

Given probability vectors $a, b\in \Sigma_d$ in the $d$ -simplex, a nonnegative cost matrix $C\in\mathbb{R}_+^{d\times d}$ , and a convex, smooth regularizer $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ , RMD is formulated on the transport polytope

$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$

Primal (Constrained) Formulation:

Given an "allowance" $\alpha\geq 0$ for the regularizer,

$d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$

where $X^{\star}$ solves $\min_{X\in U(a,b)} \Omega(X)$ .

Dual (Penalized) Formulation:

Introducing a Lagrange parameter $\lambda\geq 0$ ,

$d$ 0

For $d$ 1 below a threshold, there exists a unique $d$ 2 with $d$ 3 so that the primal and dual minimizers coincide. Classical EMD is recovered as $d$ 4, while $d$ 5 yields the minimal- $d$ 6 coupling (Dessein et al., 2016).

2. Bregman-Projection Matrix-Nearness Equivalence

Let $d$ 7 be the Bregman-type information regularizer generated by a convex function $d$ 8, either separable ( $d$ 9) or general. The Fenchel conjugate $C\in\mathbb{R}_+^{d\times d}$ 0 induces the Bregman divergence: $C\in\mathbb{R}_+^{d\times d}$ 1 The dual RMD problem equivalently minimizes the Bregman divergence

$C\in\mathbb{R}_+^{d\times d}$ 2

where $C\in\mathbb{R}_+^{d\times d}$ 3 is obtained by unconstrained minimization of $C\in\mathbb{R}_+^{d\times d}$ 4. For the entropic regularizer, $C\in\mathbb{R}_+^{d\times d}$ 5, yielding a Kullback–Leibler projection interpretation. This general Bregman-projection framework enables the use of projection algorithms for regularized OT (Dessein et al., 2016).

3. Iterative Bregman Projection Algorithms: ASA and NASA

Efficient solution of the RMD projection is based on iterative Bregman projections, with two principal algorithmic frameworks determined by the domain of the regularizer $C\in\mathbb{R}_+^{d\times d}$ 6:

Nonnegative Alternate Scaling Algorithm (NASA):

Used when $C\in\mathbb{R}_+^{d\times d}$ 7 does not enforce $C\in\mathbb{R}_+^{d\times d}$ 8. Dykstra's algorithm augments alternate Bregman projections with correction variables, cycling through nonnegativity, row-sum, and column-sum constraints. Newton–Raphson solves the one-dimensional per-row and per-column projection equations when $C\in\mathbb{R}_+^{d\times d}$ 9 is separable. Each iteration maintains correction vectors to ensure convergence.

Alternate Scaling Algorithm (ASA):

Applied when $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 0's domain is contained in $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 1, so $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 2 is implicit. The method alternates between row-sum and column-sum Bregman projections with no correction variables. For separable $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 3, updates decouple into per-row and per-column monotone equations efficiently solved by Newton–Raphson.

Both schemes generalize the classical projection-on-convex-sets (POCS) framework and leverage the explicit structure of $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 4 and $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 5 for efficient updates (Dessein et al., 2016).

4. Regularizer Families and Induced Divergences

The RMD framework supports a broad gallery of convex regularizers $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 6 (see Table 1), each yielding a distinct Bregman divergence $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 7 and associated geometric and statistical properties.

Regularizer Type	$\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 8 Definition	Induced $\Omega:\mathbb{R}^{d\times d}\to\mathbb{R}\cup\{+\infty\}$ 9
Entropic (KL)	$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 0	$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 1
Burg (Itakura–Saito)	$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 2	$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 3
Fermi–Dirac	$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 4	$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 5
$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 6 (quasi-norms, $U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 7)	$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 8	—
$U(a,b) = \{X\in\mathbb{R}_+^{d\times d}\mid X\mathbb{1}=a,\; X^T\mathbb{1}=b\}.$ 9 (norms, $\alpha\geq 0$ 0)	$\alpha\geq 0$ 1	—
Euclidean ( $\alpha\geq 0$ 2)	$\alpha\geq 0$ 3	$\alpha\geq 0$ 4
Hellinger-type	$\alpha\geq 0$ 5	—
Mahalanobis (quadratic form)	$\alpha\geq 0$ 6	—

Notably, the framework recovers Sinkhorn–Knopp scaling for KL entropic regularization (where $\alpha\geq 0$ 7 and Newton projections become matrix rescalings), but allows for fundamentally different plan structures, smoothing, and sparsification depending on the regularizer choice (Dessein et al., 2016).

5. Empirical Properties and Algorithmic Considerations

RMD exhibits several algorithmic and empirical characteristics:

Interplay of $\alpha\geq 0$ 8 and $\alpha\geq 0$ 9:

Varying the regularizer parameter $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 0 yields a continuous interpolation between sharply optimal (EMD-like) and highly regularized transport plans. The geometry, such as anisotropy or smoothness, is strongly modulated by the nature of $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 1.

Computational Complexity:

For moderate $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 2, Newton subproblems within ASA/NASA scale as $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 3 or $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 4 per projection, with overall quadratic complexity per outer iteration. Sinkhorn–Knopp (KL) admits the fastest implementation; ASA is empirically faster than NASA due to the absence of correction variables.

Sparsity via Pruning:

Transport forbidden by $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 5 is handled by sparse extensions, simply excluding these indices from updates without affecting result correctness under broad conditions (Dessein et al., 2016).

6. Applications and Empirical Results

Synthetic experiments on two-mode densities demonstrate that tuning $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 6 and choosing different $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 7 yields qualitatively distinct mass redistributions, such as $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 8- versus $d'_{C,\alpha,\Omega}(a,b) = \min_{X\in U(a,b),\, \Omega(X)\leq\Omega(X^{\star})+\alpha}\;\langle C,X\rangle,$ 9-like smoothing and various anisotropic effects. In audio-scene classification benchmarks (specifically, DCASE16), RMD-induced kernels—where each segment is encoded as a GMM over MFCCs, with OT ground cost given by pairwise Jeffrey-divergence—achieve superior or competitive accuracy to classical EMD-based SVM kernels. The Hellinger-type and certain $X^{\star}$ 0 penalties ( $X^{\star}$ 1– $X^{\star}$ 2) outperform classic Euclidean ( $X^{\star}$ 3) or Burg-IS regularizers in discriminative capacity. This suggests that fine-grained adjustment of $X^{\star}$ 4 and $X^{\star}$ 5 can significantly enhance OT-based kernel methods for pattern recognition and statistical tasks (Dessein et al., 2016).

7. Connections and Generalizations

RMD provides a principled interpolation and generalization over standard optimal transport, embedding the classic EMD, entropic regularization (Sinkhorn), and other divergences in a single algorithmic and theoretical scaffold. The Bregman-projection viewpoint enables leverage of convex duality and optimization-theoretic tools, including Newton–Raphson projection for separable $X^{\star}$ 6, Dykstra’s algorithm for general convex settings, and efficient sparse extensions. The framework is compatible with a variety of regularizer classes encountered in machine learning and information geometry, supporting both spread-promoting and sparsity-inducing couplings according to analytic or empirical desiderata (Dessein et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Regularized Optimal Transport and the Rot Mover's Distance (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rot Mover’s Distance (RMD).