Optimal Transport in NN Methods

Updated 15 January 2026

Optimal Transport is a geometric framework that measures differences between distributions, enhancing nearest neighbor methods with deformation-aware metrics used in image analysis.
It employs both discrete linear programming and PDE formulations to compute the Wasserstein distance, leading to improved classification accuracy as demonstrated on MNIST.
The approach also facilitates the construction of adaptive neighborhood graphs through a quadratically-regularized optimization, outperforming traditional kNN methods in manifold learning.

Optimal transport (OT) provides a rigorous geometric framework for measuring differences between probability distributions, endowing nearest neighbor (NN) methods with a robust, deformation-aware notion of similarity. The use of OT, and specifically the Wasserstein distance, in NN-based algorithms has informed both classification (notably in image domains) and the construction of adaptive, sparse neighborhood graphs. These developments rely on both discrete linear programming and partial differential equation (PDE) formulations of OT, leading to metrics that account for spatial rearrangement of mass. This article examines the mathematical foundations, computational methodologies, and empirical performance of NN methods informed by optimal transport, describing their advantages over classical approaches and situating them in the context of recent academic work.

1. Mathematical Foundations of Optimal Transport for Nearest Neighbor Distances

Optimal transport-based NN methods adopt a cost-minimizing formulation to compare distributions (or images), replacing classical ℓ₂ or correlation-based metrics. In the classical discrete OT setting, given two histograms representing mass distributions over fixed supports (e.g., pixel centers $\{x_i\}$ for images), the Monge–Kantorovich problem seeks a transport plan $\pi \in \mathbb{R}^{n \times n}_{\geq 0}$ such that the marginals match the input histograms: $\Gamma(f, g) = \left\{ \pi \in \mathbb{R}^{n \times n}_{\geq 0} : \sum_{j=1}^n \pi_{ij} = f(x_i),\; \sum_{i=1}^n \pi_{ij} = g(y_j) \right\}.$ With squared-Euclidean costs $c_{ij} = \|x_i - y_j\|^2$ , the primal linear program is

$W_2^2(f, g) = \min_{\pi \in \Gamma(f, g)} \sum_{i=1}^n \sum_{j=1}^n c_{ij} \pi_{ij}.$

The LP dual introduces potentials $u, v \in \mathbb{R}^n$ with constraint $u_i + v_j \leq c_{ij}$ .

The continuous counterpart, via the Monge–Ampère equation, interprets images as strictly positive probability densities $f$ , $g$ on $[0, 1]^2$ and seeks a map $T(x) = x - \nabla u(x)$ pushing $f$ onto $g$ . The Monge–Ampère PDE reads: $\det(I - \nabla^2 u(x))\, g(x - \nabla u(x)) = f(x),\quad \text{with} \ \partial u/\partial n = 0.$ The $L^2$ -Wasserstein distance is then

$W_2(f, g) = \left( \frac{1}{2} \int_{[0,1]^2} |x - T(x)|^2\, f(x)\, dx \right)^{1/2}.$

This formulation results in a metric space over distributions, allowing for comparisons that respect underlying geometric distortions (Snow et al., 2018, Snow et al., 2016).

2. Embedding Wasserstein Distance into 1-Nearest Neighbor Classification

To employ OT in 1-NN classification, each sample (e.g., an image) is preprocessed into a normalized distribution (e.g., pixel intensities normalized to sum to 1, possibly regularized with a small constant to avoid zeros). For each test sample, the $W_2^2$ distance is computed between the query and all stored training samples—each distance corresponding to the cost of optimally rearranging the query's mass to match each training distribution.

For discrete measures, the process involves:

Flattening each $n \times n$ image to a probability vector $f(x_i)$ .
Forming the cost matrix $c_{ij}$ .
Solving the OT linear program (LP) or the Monge–Ampère PDE for each test/train pair.
Assigning the test sample to the class of its transport-minimizing neighbor.

Naive computational cost is $O(MN^3)$ for $M$ train and $N$ pixels, motivating the use of specialized OT solvers or approximation heuristics (Snow et al., 2018, Snow et al., 2016). The PDE approach, leveraging Newton–Raphson methods with careful discretization, scaling as $O(N^{1.7})$ per pair, was empirically found practical for moderate resolutions.

3. Empirical Results and Geometric Advantages in Image Domains

Empirical evaluation on MNIST shows 1-NN classification with OT-based distances systematically outperforming conventional metrics. For example, with 210 training images (21 per digit), average 1-NN accuracy is approximately 81.4% for the discrete Kantorovich OT, 82.6% for the PDE OT, compared to 75.5% for Euclidean and 80.6% for Tangent Space distances (Snow et al., 2018, Snow et al., 2016). The improvement is especially pronounced in regimes with limited training data.

The Wasserstein metric penalizes mass displacement quadratically, conferring robustness to small image deformations (translations, thickness changes, fragmentation), which traditional $\ell_2$ distances exaggerate. NN boundaries under OT reflect the geometry of shape variations, leading to semantically meaningful decision regions that better respect inherent class structure in data manifolds.

4. OT-based Neighborhood Graph Construction and the Quadratically-Regularized Approach

Beyond pairwise classification, OT also informs the construction of adaptive neighborhood graphs. The quadratically-regularised OT (QOT) method formulates graph construction as a single-parameter optimization problem: $\min_{\Gamma \ge 0} \;\langle C, \Gamma \rangle + \frac{\alpha}{2} \|\Gamma\|_F^2$ subject to

$\Gamma\,\mathbf{1} = \mathbf{1},\; \Gamma = \Gamma^T.$

The parameter $\alpha$ interpolates between a trivial identity graph ( $\alpha \rightarrow 0^+$ ) and a fully connected uniform graph ( $\alpha \rightarrow \infty$ ), with moderate $\alpha$ yielding adaptive sparsity. The unique KKT structure yields a closed-form thresholding: $\gamma_{ij} = \frac{1}{\alpha} \max\{ 0, u_i + u_j - c_{ij} \},$ where $u$ is chosen so that row sums normalize to 1.

QOT graphs automatically adapt to local data density and noise, unlike $k$ NN or $\varepsilon$ -neighborhood graphs, which require tuning additional parameters (Matsumoto et al., 2022). The semismooth Newton method enables efficient computation and exact sparsity in the edge matrix, favoring robust downstream learning tasks.

5. Comparisons with Classical Nearest Neighbor Approaches

Classical $k$ NN graphs connect each data point to its $k$ closest neighbors, irrespective of sampling density. In irregular or noisy datasets, this may induce spurious edges or oversmooth dense regions. $\varepsilon$ -neighborhood graphs require precise tuning to avoid under- or over-connection. In contrast, QOT provides a principled, continuous interpolation—from highly localized to nearly all-to-all connectivity—through a single regularization parameter.

Empirical studies demonstrate that QOT graphs yield superior spectral embeddings in manifold learning, maintain local geometry under heteroskedastic noise, and outperform traditional graphs in semi-supervised learning scenarios (e.g., label propagation). In applications such as single-cell RNA sequencing, QOT enables high-quality imputations without the need for per-sample bandwidth tuning (Matsumoto et al., 2022).

6. Limitations, Computational Scalability, and Modeling Considerations

The main computational limitation of OT-based NN methods is the cost of solving high-dimensional transport problems, which, despite algorithmic advances (e.g., network-flow solvers, PDE accelerations), remains significantly higher than direct vector-based distances. PDE-based methods require careful numerical implementation, including the management of boundary conditions and tuning of damping parameters, and generally assume strictly positive density support. The absence of convergence guarantees for some solvers (such as for the Monge–Ampère PDE) is noted empirically rather than proven theoretically.

Approximate solvers, coarse-to-fine hierarchies, and restriction of admissible transport arcs offer avenues for further computational gains. Nonetheless, the modeling fidelity achieved by embedding OT geometry into nearest neighbor frameworks yields metrics and neighborhood structures with better invariance to nuisance variability and which more faithfully reflect latent data structure (Snow et al., 2018, Snow et al., 2016, Matsumoto et al., 2022).