Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wasserstein Distance-Preserving Embeddings

Updated 14 November 2025
  • Wasserstein distance-preserving embeddings are mappings that embed probability distributions into latent spaces while retaining optimal transport geometry.
  • They utilize methods such as template-based approaches, multidimensional scaling, and neural architectures to enable efficient and scalable distribution analysis.
  • These embeddings enhance tasks like classification and manifold learning by capturing structural and multimodal relationships with provable distortion bounds.

A Wasserstein distance-preserving embedding is a mapping from a space of probability measures (distributions) equipped with the pp-Wasserstein metric into a Euclidean or structured latent space such that the Euclidean (or other) geometry of embeddings closely approximates the original Wasserstein geometry. Unlike kernel mean embeddings which map distributions to points in a Hilbert space by averaging, Wasserstein embeddings directly encode optimal transport-based geometry—capturing structural, multimodal, or hierarchical relationships between distributions that kernel approaches often miss. Such embeddings facilitate scalable, distance-aware learning and analysis of datasets where each object is a probability distribution (e.g., histograms, point clouds, sequence distributions, or sets of images).

1. Formal Definition and Theoretical Foundations

Let Pp(X)\mathcal{P}_p(X) be the space of Borel probability measures with finite pp-th moments on a Polish metric space (X,dX)(X, d_X). The pp-Wasserstein distance is

Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},

where Π(μ,ν)\Pi(\mu,\nu) is the set of couplings with marginals μ,ν\mu, \nu.

A distance-preserving or isometric embedding Φ\Phi is a map Φ:Pp(X)Rm\Phi:\mathcal{P}_p(X) \to \mathbb{R}^m such that

Pp(X)\mathcal{P}_p(X)0

Sometimes the mapping is only bi-Lipschitz with distortion Pp(X)\mathcal{P}_p(X)1: Pp(X)\mathcal{P}_p(X)2

For Pp(X)\mathcal{P}_p(X)3 on a finite metric space Pp(X)\mathcal{P}_p(X)4, it is known that if Pp(X)\mathcal{P}_p(X)5 admits a stochastic embedding into metric trees with distortion Pp(X)\mathcal{P}_p(X)6, then Pp(X)\mathcal{P}_p(X)7 bi-Lipschitz embeds into Pp(X)\mathcal{P}_p(X)8 with the same distortion Pp(X)\mathcal{P}_p(X)9 using the Evans–Matsen formula, which enables explicit, linear-complexity isometric embeddings into pp0 (Mathey-Prevot et al., 2021). More generally, for pp1, perfect isometry is not possible in general, but embeddings with small distortion exist for large finite metrics (Frogner et al., 2019).

2. Canonical Construction Approaches

2.1 Template-based Wasserstein Embeddings

The template (or dissimilarity) embedding constructs a map from a probability measure pp2 to a vector of Wasserstein distances to a selected set of template distributions pp3: pp4 where pp5 is a normalization constant (e.g., a bound on pp6). The quality of the embedding—such as linear separability for classification—follows from the "good dissimilarity" theory, which guarantees that for sufficient pp7, the resulting feature cloud can be separated with low error by a linear classifier provided Wasserstein is a good dissimilarity for the task (Rakotomamonjy et al., 2018).

The template set can be chosen by random sampling, clustering (e.g., pp8-means in an RKHS), or heuristically.

2.2 Metric MDS with Wasserstein Geometry

Classical multidimensional scaling (MDS) is used to construct embeddings such that Euclidean distances align with a given dissimilarity matrix. For distributions pp9, define the matrix (X,dX)(X, d_X)0 (or its square). The MDS embedding seeks (X,dX)(X, d_X)1 such that

(X,dX)(X, d_X)2

This approach is "exact" for Wasserstein-flat manifolds such as translation families, as shown in Isometric Wasserstein Mapping (Wassmap) (Hamm et al., 2022), and can be extended with more efficient runtime via the Linearized OT (LOT) approach using a single reference measure and only (X,dX)(X, d_X)3 OT solves (Cloninger et al., 2023).

2.3 Deep and Neural Wasserstein Metric Embeddings

Parametric neural architectures (Siamese, Transformer, encoder–decoder) can be trained on pairs of distributions (X,dX)(X, d_X)4 to minimize the discrepancy between (X,dX)(X, d_X)5 and (X,dX)(X, d_X)6 (or related Sinkhorn divergences). For instance, the Deep Wasserstein Embedding (DWE) model uses a Siamese CNN encoder with a coupled decoder for barycenter and inverse-mapping tasks, trained to minimize

(X,dX)(X, d_X)7

with optional reconstruction or sparsity penalties (Courty et al., 2017, Haviv et al., 2024).

3. Distortion Bounds and Theoretical Guarantees

3.1 Rigorous (Bi-)Lipschitz Results

  • For stochastic tree metrics, (X,dX)(X, d_X)8 bi-Lipschitzly embeds into (X,dX)(X, d_X)9 with tight distortion bound matching the tree metric embedding (Mathey-Prevot et al., 2021).
  • In high dimensions or when pp0, exact isometry is unattainable globally; for finite pp1-point metrics, pp2-spaces can embed with distortion pp3 (universality, metric theory (Frogner et al., 2019)).
  • For the template embedding, the margin and linear separability of the resulting feature cloud are controlled by the number of templates and the alignment of Wasserstein geometry with the true task classes (Rakotomamonjy et al., 2018).

3.2 Sample Complexity and Concentration

  • The number of templates pp4 required for faithful (low error, high margin) embedding is pp5, with pp6 a Wasserstein upper bound and pp7 the separation margin.
  • When working with empirical measures pp8 estimated from pp9 samples, control over the error Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},0 scales as Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},1, with Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},2 for Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},3 in Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},4 dimensions (Rakotomamonjy et al., 2018, Cloninger et al., 2023).

3.3 Approximation Error Induced by Compression/Linearization

Using linearized optimal transport (LOT) instead of pairwise OT distances introduces an additive distortion Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},5; the embedding error is upper-bounded by the combined error from linearization, regularization (if using Sinkhorn), and finite-sample estimation—all controlled with explicit high-probability inequalities (Cloninger et al., 2023). For exact translation or scaling families, zero loss is achievable.

4. Algorithmic Realizations

4.1 Discrete and Entropic OT Computation

  • Discrete OT for histograms with Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},6 points can be solved via network simplex in Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},7, or with Sinkhorn regularization Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},8 per pair, both with GPU acceleration options (Rakotomamonjy et al., 2018).
  • For empirical Gaussian measures, Wp(μ,ν)=(infπΠ(μ,ν)X×XdX(x,y)pdπ(x,y))1/p,W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{X\times X} d_X(x,y)^p\, d\pi(x,y) \right)^{1/p},9 can be computed in closed form (Bures metric, Π(μ,ν)\Pi(\mu,\nu)0 via eigendecomposition) (Rakotomamonjy et al., 2018, Bachmann et al., 2022).

4.2 Efficient Embedding Construction

  • Template embedding: Π(μ,ν)\Pi(\mu,\nu)1 matrix construction followed by linear SVM training.
  • MDS-based approaches: Π(μ,ν)\Pi(\mu,\nu)2 pairwise OT computations and Π(μ,ν)\Pi(\mu,\nu)3 eigendecomposition; LOT-based approaches reduce pairwise OT to Π(μ,ν)\Pi(\mu,\nu)4 (Cloninger et al., 2023).
  • Neural methods: Training cost dominated by batch size, OT/sinkhorn computation per pair, and encoder/decoder forward/backpass. Batch-based co-embedding and Sinkhorn backpropagation scale efficiently (Frogner et al., 2019, Haviv et al., 2024).

5. Practical Applications and Empirical Outcomes

Table: Selected Wasserstein Distance-Preserving Embedding Methods and Their Applications

Method (Paper) Principle Typical Applications
Template Wasserstein Embedding (Rakotomamonjy et al., 2018) Distances to reference measures Distribution classification (scene, point-clouds)
MDS/Wassmap (Hamm et al., 2022) MDS on Π(μ,ν)\Pi(\mu,\nu)5 distance matrix Image manifolds, synthetic translations/dilations
LOT Wassmap (Cloninger et al., 2023) Linearized OT Large-scale manifold discovery, high-dim OT
Neural Metric Embedding (Courty et al., 2017, Haviv et al., 2024) Deep metric learning Fast OT-based similarity (images, point clouds)
Entropic Wasserstein PCA (Collas et al., 2023) Subspace projection via OT Gene expression analysis, cluster structure
Wasserstein t-SNE (Bachmann et al., 2022) Low-dim visualization via Π(μ,ν)\Pi(\mu,\nu)6 Embedding of hierarchical/grouped data
Stochastic Vision Transformers (Erick et al., 2023) OT-aware attention Image SSL, OOD detection, calibration
Cantor–Wasserstein (Loomis et al., 2022) Symbolic sequence embeddings Predictive state geometry, sequence clustering

Empirical studies show that Wasserstein distance-preserving embeddings outperform kernel mean embeddings and other Euclidean geometry-based approaches on real and synthetic distributional classification, manifold learning, and representation tasks—especially in the presence of non-Euclidean or hierarchical structure. For example, in 3D point-cloud classification, Wasserstein template embedding achieves up to Π(μ,ν)\Pi(\mu,\nu)7 accuracy vs Π(μ,ν)\Pi(\mu,\nu)8 for kernel mean methods (Rakotomamonjy et al., 2018). On synthetic translation/dilation manifolds, Wasserstein MDS recovers latent coordinates up to rigid motion with nearly zero stress (Hamm et al., 2022). Neural approaches accelerate OT computations by Π(μ,ν)\Pi(\mu,\nu)9–μ,ν\mu, \nu0 in high-throughput settings (Courty et al., 2017).

6. Extensions, Limitations, and Open Challenges

Theoretical isometry is only achievable in narrow cases: μ,ν\mu, \nu1 (tree or Cantor geometry) or distributions restricted to translation/dilation subgroups. In general, μ,ν\mu, \nu2—especially for non-Gaussian and high-dimensional supports—embeddings unavoidably incur distortion, controlled by sample size, manifold curvature, and network/exemplar capacity. Approximate, scalable methods such as linearized OT, entropic regularization, and neural metric learning provide practical compromises, with distortion controlled empirically and in some cases theoretically bounded (Cloninger et al., 2023, Frogner et al., 2019). For symbolic data, Cantor embeddings furnish a bi-Lipschitz mapping into μ,ν\mu, \nu3 with uniform distortion constants, enabling effective clustering and visualization with 1D Wasserstein distances (Loomis et al., 2022).

Current limitations include distortion/dimensional blowup for generic metrics, challenges in scaling to very high-dimensional or continuous measure spaces, and lack of universal approximation bounds for deep-neural OT embeddings away from locally-concentrated or tree-like metric spaces. Ongoing research investigates embedding universality, optimal reference measure selection for LOT and related approaches, explicit distortion rates in neural embeddings, and specialized architectures for structured domains (e.g., graphs, images, spatial-temporal processes).

Wasserstein distance-preserving embeddings bridge optimal transport, metric/representation learning, and dimensionality reduction. They build on, and generalize, kernel mean embeddings (RKHS), classical MDS, and manifold learning, while exploiting the geometric and probabilistic structure of distributions. Applications span distributional supervised learning, OT-based clustering, visualization, generative modeling, robust and uncertainty-aware deep learning, as well as interpretable sequence and hierarchical data analysis.

Significant synergies exist with Gromov–Wasserstein geometry for structural data comparison, with distributional regularization in Bayesian and deep models, and with recent advances in scalable OT computation—particularly in settings where computational tractability and geometric fidelity of distributional relationships are critical.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wasserstein Distance-Preserving Embeddings.