Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimal Transport in NN Methods

Updated 15 January 2026
  • Optimal Transport is a geometric framework that measures differences between distributions, enhancing nearest neighbor methods with deformation-aware metrics used in image analysis.
  • It employs both discrete linear programming and PDE formulations to compute the Wasserstein distance, leading to improved classification accuracy as demonstrated on MNIST.
  • The approach also facilitates the construction of adaptive neighborhood graphs through a quadratically-regularized optimization, outperforming traditional kNN methods in manifold learning.

Optimal transport (OT) provides a rigorous geometric framework for measuring differences between probability distributions, endowing nearest neighbor (NN) methods with a robust, deformation-aware notion of similarity. The use of OT, and specifically the Wasserstein distance, in NN-based algorithms has informed both classification (notably in image domains) and the construction of adaptive, sparse neighborhood graphs. These developments rely on both discrete linear programming and partial differential equation (PDE) formulations of OT, leading to metrics that account for spatial rearrangement of mass. This article examines the mathematical foundations, computational methodologies, and empirical performance of NN methods informed by optimal transport, describing their advantages over classical approaches and situating them in the context of recent academic work.

1. Mathematical Foundations of Optimal Transport for Nearest Neighbor Distances

Optimal transport-based NN methods adopt a cost-minimizing formulation to compare distributions (or images), replacing classical ℓ₂ or correlation-based metrics. In the classical discrete OT setting, given two histograms representing mass distributions over fixed supports (e.g., pixel centers {xi}\{x_i\} for images), the Monge–Kantorovich problem seeks a transport plan πR0n×n\pi \in \mathbb{R}^{n \times n}_{\geq 0} such that the marginals match the input histograms: Γ(f,g)={πR0n×n:j=1nπij=f(xi),  i=1nπij=g(yj)}.\Gamma(f, g) = \left\{ \pi \in \mathbb{R}^{n \times n}_{\geq 0} : \sum_{j=1}^n \pi_{ij} = f(x_i),\; \sum_{i=1}^n \pi_{ij} = g(y_j) \right\}. With squared-Euclidean costs cij=xiyj2c_{ij} = \|x_i - y_j\|^2, the primal linear program is

W22(f,g)=minπΓ(f,g)i=1nj=1ncijπij.W_2^2(f, g) = \min_{\pi \in \Gamma(f, g)} \sum_{i=1}^n \sum_{j=1}^n c_{ij} \pi_{ij}.

The LP dual introduces potentials u,vRnu, v \in \mathbb{R}^n with constraint ui+vjciju_i + v_j \leq c_{ij}.

The continuous counterpart, via the Monge–Ampère equation, interprets images as strictly positive probability densities ff, gg on [0,1]2[0, 1]^2 and seeks a map T(x)=xu(x)T(x) = x - \nabla u(x) pushing ff onto gg. The Monge–Ampère PDE reads: det(I2u(x))g(xu(x))=f(x),with u/n=0.\det(I - \nabla^2 u(x))\, g(x - \nabla u(x)) = f(x),\quad \text{with} \ \partial u/\partial n = 0. The L2L^2-Wasserstein distance is then

W2(f,g)=(12[0,1]2xT(x)2f(x)dx)1/2.W_2(f, g) = \left( \frac{1}{2} \int_{[0,1]^2} |x - T(x)|^2\, f(x)\, dx \right)^{1/2}.

This formulation results in a metric space over distributions, allowing for comparisons that respect underlying geometric distortions (Snow et al., 2018, Snow et al., 2016).

2. Embedding Wasserstein Distance into 1-Nearest Neighbor Classification

To employ OT in 1-NN classification, each sample (e.g., an image) is preprocessed into a normalized distribution (e.g., pixel intensities normalized to sum to 1, possibly regularized with a small constant to avoid zeros). For each test sample, the W22W_2^2 distance is computed between the query and all stored training samples—each distance corresponding to the cost of optimally rearranging the query's mass to match each training distribution.

For discrete measures, the process involves:

  1. Flattening each n×nn \times n image to a probability vector f(xi)f(x_i).
  2. Forming the cost matrix cijc_{ij}.
  3. Solving the OT linear program (LP) or the Monge–Ampère PDE for each test/train pair.
  4. Assigning the test sample to the class of its transport-minimizing neighbor.

Naive computational cost is O(MN3)O(MN^3) for MM train and NN pixels, motivating the use of specialized OT solvers or approximation heuristics (Snow et al., 2018, Snow et al., 2016). The PDE approach, leveraging Newton–Raphson methods with careful discretization, scaling as O(N1.7)O(N^{1.7}) per pair, was empirically found practical for moderate resolutions.

3. Empirical Results and Geometric Advantages in Image Domains

Empirical evaluation on MNIST shows 1-NN classification with OT-based distances systematically outperforming conventional metrics. For example, with 210 training images (21 per digit), average 1-NN accuracy is approximately 81.4% for the discrete Kantorovich OT, 82.6% for the PDE OT, compared to 75.5% for Euclidean and 80.6% for Tangent Space distances (Snow et al., 2018, Snow et al., 2016). The improvement is especially pronounced in regimes with limited training data.

The Wasserstein metric penalizes mass displacement quadratically, conferring robustness to small image deformations (translations, thickness changes, fragmentation), which traditional 2\ell_2 distances exaggerate. NN boundaries under OT reflect the geometry of shape variations, leading to semantically meaningful decision regions that better respect inherent class structure in data manifolds.

4. OT-based Neighborhood Graph Construction and the Quadratically-Regularized Approach

Beyond pairwise classification, OT also informs the construction of adaptive neighborhood graphs. The quadratically-regularised OT (QOT) method formulates graph construction as a single-parameter optimization problem: minΓ0  C,Γ+α2ΓF2\min_{\Gamma \ge 0} \;\langle C, \Gamma \rangle + \frac{\alpha}{2} \|\Gamma\|_F^2 subject to

Γ1=1,  Γ=ΓT.\Gamma\,\mathbf{1} = \mathbf{1},\; \Gamma = \Gamma^T.

The parameter α\alpha interpolates between a trivial identity graph (α0+\alpha \rightarrow 0^+) and a fully connected uniform graph (α\alpha \rightarrow \infty), with moderate α\alpha yielding adaptive sparsity. The unique KKT structure yields a closed-form thresholding: γij=1αmax{0,ui+ujcij},\gamma_{ij} = \frac{1}{\alpha} \max\{ 0, u_i + u_j - c_{ij} \}, where uu is chosen so that row sums normalize to 1.

QOT graphs automatically adapt to local data density and noise, unlike kkNN or ε\varepsilon-neighborhood graphs, which require tuning additional parameters (Matsumoto et al., 2022). The semismooth Newton method enables efficient computation and exact sparsity in the edge matrix, favoring robust downstream learning tasks.

5. Comparisons with Classical Nearest Neighbor Approaches

Classical kkNN graphs connect each data point to its kk closest neighbors, irrespective of sampling density. In irregular or noisy datasets, this may induce spurious edges or oversmooth dense regions. ε\varepsilon-neighborhood graphs require precise tuning to avoid under- or over-connection. In contrast, QOT provides a principled, continuous interpolation—from highly localized to nearly all-to-all connectivity—through a single regularization parameter.

Empirical studies demonstrate that QOT graphs yield superior spectral embeddings in manifold learning, maintain local geometry under heteroskedastic noise, and outperform traditional graphs in semi-supervised learning scenarios (e.g., label propagation). In applications such as single-cell RNA sequencing, QOT enables high-quality imputations without the need for per-sample bandwidth tuning (Matsumoto et al., 2022).

6. Limitations, Computational Scalability, and Modeling Considerations

The main computational limitation of OT-based NN methods is the cost of solving high-dimensional transport problems, which, despite algorithmic advances (e.g., network-flow solvers, PDE accelerations), remains significantly higher than direct vector-based distances. PDE-based methods require careful numerical implementation, including the management of boundary conditions and tuning of damping parameters, and generally assume strictly positive density support. The absence of convergence guarantees for some solvers (such as for the Monge–Ampère PDE) is noted empirically rather than proven theoretically.

Approximate solvers, coarse-to-fine hierarchies, and restriction of admissible transport arcs offer avenues for further computational gains. Nonetheless, the modeling fidelity achieved by embedding OT geometry into nearest neighbor frameworks yields metrics and neighborhood structures with better invariance to nuisance variability and which more faithfully reflect latent data structure (Snow et al., 2018, Snow et al., 2016, Matsumoto et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimal Transport Interpretation of Nearest Neighbor Methods.