Wasserstein Distance-Preserving Embeddings
- Wasserstein distance-preserving embeddings are mappings that embed probability distributions into latent spaces while retaining optimal transport geometry.
- They utilize methods such as template-based approaches, multidimensional scaling, and neural architectures to enable efficient and scalable distribution analysis.
- These embeddings enhance tasks like classification and manifold learning by capturing structural and multimodal relationships with provable distortion bounds.
A Wasserstein distance-preserving embedding is a mapping from a space of probability measures (distributions) equipped with the -Wasserstein metric into a Euclidean or structured latent space such that the Euclidean (or other) geometry of embeddings closely approximates the original Wasserstein geometry. Unlike kernel mean embeddings which map distributions to points in a Hilbert space by averaging, Wasserstein embeddings directly encode optimal transport-based geometry—capturing structural, multimodal, or hierarchical relationships between distributions that kernel approaches often miss. Such embeddings facilitate scalable, distance-aware learning and analysis of datasets where each object is a probability distribution (e.g., histograms, point clouds, sequence distributions, or sets of images).
1. Formal Definition and Theoretical Foundations
Let be the space of Borel probability measures with finite -th moments on a Polish metric space . The -Wasserstein distance is
where is the set of couplings with marginals .
A distance-preserving or isometric embedding is a map such that
Sometimes the mapping is only bi-Lipschitz with distortion :
For on a finite metric space , it is known that if admits a stochastic embedding into metric trees with distortion , then bi-Lipschitz embeds into with the same distortion using the Evans–Matsen formula, which enables explicit, linear-complexity isometric embeddings into (Mathey-Prevot et al., 2021). More generally, for , perfect isometry is not possible in general, but embeddings with small distortion exist for large finite metrics (Frogner et al., 2019).
2. Canonical Construction Approaches
2.1 Template-based Wasserstein Embeddings
The template (or dissimilarity) embedding constructs a map from a probability measure to a vector of Wasserstein distances to a selected set of template distributions : where is a normalization constant (e.g., a bound on ). The quality of the embedding—such as linear separability for classification—follows from the "good dissimilarity" theory, which guarantees that for sufficient , the resulting feature cloud can be separated with low error by a linear classifier provided Wasserstein is a good dissimilarity for the task (Rakotomamonjy et al., 2018).
The template set can be chosen by random sampling, clustering (e.g., -means in an RKHS), or heuristically.
2.2 Metric MDS with Wasserstein Geometry
Classical multidimensional scaling (MDS) is used to construct embeddings such that Euclidean distances align with a given dissimilarity matrix. For distributions , define the matrix (or its square). The MDS embedding seeks such that
This approach is "exact" for Wasserstein-flat manifolds such as translation families, as shown in Isometric Wasserstein Mapping (Wassmap) (Hamm et al., 2022), and can be extended with more efficient runtime via the Linearized OT (LOT) approach using a single reference measure and only OT solves (Cloninger et al., 2023).
2.3 Deep and Neural Wasserstein Metric Embeddings
Parametric neural architectures (Siamese, Transformer, encoder–decoder) can be trained on pairs of distributions to minimize the discrepancy between and (or related Sinkhorn divergences). For instance, the Deep Wasserstein Embedding (DWE) model uses a Siamese CNN encoder with a coupled decoder for barycenter and inverse-mapping tasks, trained to minimize
with optional reconstruction or sparsity penalties (Courty et al., 2017, Haviv et al., 15 Apr 2024).
3. Distortion Bounds and Theoretical Guarantees
3.1 Rigorous (Bi-)Lipschitz Results
- For stochastic tree metrics, bi-Lipschitzly embeds into with tight distortion bound matching the tree metric embedding (Mathey-Prevot et al., 2021).
- In high dimensions or when , exact isometry is unattainable globally; for finite -point metrics, -spaces can embed with distortion (universality, metric theory (Frogner et al., 2019)).
- For the template embedding, the margin and linear separability of the resulting feature cloud are controlled by the number of templates and the alignment of Wasserstein geometry with the true task classes (Rakotomamonjy et al., 2018).
3.2 Sample Complexity and Concentration
- The number of templates required for faithful (low error, high margin) embedding is , with a Wasserstein upper bound and the separation margin.
- When working with empirical measures estimated from samples, control over the error scales as , with for in dimensions (Rakotomamonjy et al., 2018, Cloninger et al., 2023).
3.3 Approximation Error Induced by Compression/Linearization
Using linearized optimal transport (LOT) instead of pairwise OT distances introduces an additive distortion ; the embedding error is upper-bounded by the combined error from linearization, regularization (if using Sinkhorn), and finite-sample estimation—all controlled with explicit high-probability inequalities (Cloninger et al., 2023). For exact translation or scaling families, zero loss is achievable.
4. Algorithmic Realizations
4.1 Discrete and Entropic OT Computation
- Discrete OT for histograms with points can be solved via network simplex in , or with Sinkhorn regularization per pair, both with GPU acceleration options (Rakotomamonjy et al., 2018).
- For empirical Gaussian measures, can be computed in closed form (Bures metric, via eigendecomposition) (Rakotomamonjy et al., 2018, Bachmann et al., 2022).
4.2 Efficient Embedding Construction
- Template embedding: matrix construction followed by linear SVM training.
- MDS-based approaches: pairwise OT computations and eigendecomposition; LOT-based approaches reduce pairwise OT to (Cloninger et al., 2023).
- Neural methods: Training cost dominated by batch size, OT/sinkhorn computation per pair, and encoder/decoder forward/backpass. Batch-based co-embedding and Sinkhorn backpropagation scale efficiently (Frogner et al., 2019, Haviv et al., 15 Apr 2024).
5. Practical Applications and Empirical Outcomes
Table: Selected Wasserstein Distance-Preserving Embedding Methods and Their Applications
| Method (Paper) | Principle | Typical Applications |
|---|---|---|
| Template Wasserstein Embedding (Rakotomamonjy et al., 2018) | Distances to reference measures | Distribution classification (scene, point-clouds) |
| MDS/Wassmap (Hamm et al., 2022) | MDS on distance matrix | Image manifolds, synthetic translations/dilations |
| LOT Wassmap (Cloninger et al., 2023) | Linearized OT | Large-scale manifold discovery, high-dim OT |
| Neural Metric Embedding (Courty et al., 2017, Haviv et al., 15 Apr 2024) | Deep metric learning | Fast OT-based similarity (images, point clouds) |
| Entropic Wasserstein PCA (Collas et al., 2023) | Subspace projection via OT | Gene expression analysis, cluster structure |
| Wasserstein t-SNE (Bachmann et al., 2022) | Low-dim visualization via | Embedding of hierarchical/grouped data |
| Stochastic Vision Transformers (Erick et al., 2023) | OT-aware attention | Image SSL, OOD detection, calibration |
| Cantor–Wasserstein (Loomis et al., 2022) | Symbolic sequence embeddings | Predictive state geometry, sequence clustering |
Empirical studies show that Wasserstein distance-preserving embeddings outperform kernel mean embeddings and other Euclidean geometry-based approaches on real and synthetic distributional classification, manifold learning, and representation tasks—especially in the presence of non-Euclidean or hierarchical structure. For example, in 3D point-cloud classification, Wasserstein template embedding achieves up to accuracy vs for kernel mean methods (Rakotomamonjy et al., 2018). On synthetic translation/dilation manifolds, Wasserstein MDS recovers latent coordinates up to rigid motion with nearly zero stress (Hamm et al., 2022). Neural approaches accelerate OT computations by – in high-throughput settings (Courty et al., 2017).
6. Extensions, Limitations, and Open Challenges
Theoretical isometry is only achievable in narrow cases: (tree or Cantor geometry) or distributions restricted to translation/dilation subgroups. In general, —especially for non-Gaussian and high-dimensional supports—embeddings unavoidably incur distortion, controlled by sample size, manifold curvature, and network/exemplar capacity. Approximate, scalable methods such as linearized OT, entropic regularization, and neural metric learning provide practical compromises, with distortion controlled empirically and in some cases theoretically bounded (Cloninger et al., 2023, Frogner et al., 2019). For symbolic data, Cantor embeddings furnish a bi-Lipschitz mapping into with uniform distortion constants, enabling effective clustering and visualization with 1D Wasserstein distances (Loomis et al., 2022).
Current limitations include distortion/dimensional blowup for generic metrics, challenges in scaling to very high-dimensional or continuous measure spaces, and lack of universal approximation bounds for deep-neural OT embeddings away from locally-concentrated or tree-like metric spaces. Ongoing research investigates embedding universality, optimal reference measure selection for LOT and related approaches, explicit distortion rates in neural embeddings, and specialized architectures for structured domains (e.g., graphs, images, spatial-temporal processes).
7. Connections to Related Research Areas
Wasserstein distance-preserving embeddings bridge optimal transport, metric/representation learning, and dimensionality reduction. They build on, and generalize, kernel mean embeddings (RKHS), classical MDS, and manifold learning, while exploiting the geometric and probabilistic structure of distributions. Applications span distributional supervised learning, OT-based clustering, visualization, generative modeling, robust and uncertainty-aware deep learning, as well as interpretable sequence and hierarchical data analysis.
Significant synergies exist with Gromov–Wasserstein geometry for structural data comparison, with distributional regularization in Bayesian and deep models, and with recent advances in scalable OT computation—particularly in settings where computational tractability and geometric fidelity of distributional relationships are critical.