Geometric Dataset Distances: A Review

Updated 8 December 2025

Geometric dataset distances are quantitative metrics that capture both statistical and geometric differences between datasets using methods like optimal transport and manifold learning.
They integrate mathematical tools such as Mahalanobis distances, Wasserstein metrics, and spectral methods to compare complex structural features while preserving invariance properties.
These distances find extensive applications in transfer learning, clustering, and domain adaptation, offering precise measurements for high-dimensional and topologically rich data.

A geometric dataset distance is a quantitative measure of dissimilarity or divergence between datasets that incorporates the geometric structure of the underlying sample, feature, or label spaces. Such distances are designed to capture differences not only at the level of sample statistics, but also in the spatial, topological, or manifold-based relations present in the datasets’ representations. The last decade has seen rapid theoretical development and empirical benchmarking of geometric dataset distances, drawing from fields including optimal transport, manifold learning, graph theory, Riemannian geometry, and topological data analysis.

1. Mathematical Foundations and Taxonomy

Geometric dataset distances admit a diverse mathematical basis, depending on the domain and design goal. At a broad level, these distances can be categorized by the structures over which they are defined and the invariances they encode.

Summary-statistics-based: Rely on chosen summary vectors (means, higher-order moments) and explicitly model covariance of features, e.g., the Constrained Minimum (CM) distance formalized as a Mahalanobis distance over summary means (Tatti, 2019).
Optimal transport and Wasserstein-type: Treat datasets as empirical distributions and define the distance via cost-minimizing couplings, with hybrid ground metrics combining feature and label dissimilarities (Alvarez-Melis et al., 2020, Nguyen et al., 31 Jan 2025).
Intrinsic and spectral methods: Employ operators tied to manifold geometry (e.g., Laplace–Beltrami) and compare the (regularized, possibly unaligned) spectra via metrics on SPD matrices (Shnitzer et al., 2022).
Geometric graphs and topological summaries: Datasets encoded as graphs are compared via edit distances, earth-mover approaches, merge-tree interleaving, or Delaunay-based alignments (Majhi et al., 2022, Majhi, 2023, Medbouhi et al., 12 Apr 2024, Chambers et al., 12 Jul 2024).
Geodesic approximation: When datasets sample an underlying manifold, graph-geodesic or spherelet-based estimates approximate the intrinsic geodesic distance matrix (Shamai et al., 2016, Li et al., 2019).
Topological interpolative distances: Quantify topological change along dataset homotopies using vineyards, tracking the metricized “work” performed by births, deaths, and motions of persistent homological features (Arulandu et al., 28 Oct 2025).
Hyperbolic and non-Euclidean models: For datasets with latent hierarchies or exponential growth, distances reflect geodesics or alignment in hyperbolic space (Medbouhi et al., 12 Apr 2024, Li et al., 30 May 2025).

The formal properties—whether pseudo-metric, metric, or non-metric—depend on the construction and the invariance requirements (e.g., representation-invariance or sample-symmetry).

2. Representative Methodologies

Class	Notable Distances / Models	Paper(s)
Summary/Mahalanobis	CM (Constrained Minimum) Distance	(Tatti, 2019)
Optimal Transport	OTDD (Optimal-Transport Dataset Distance), s-OTDD	(Alvarez-Melis et al., 2020, Nguyen et al., 31 Jan 2025)
Manifold Spectral	Log-Euclidean Signatures (LES)	(Shnitzer et al., 2022)
Geometric Graphs	GGD, GED, GMD, Labeled Merge-Tree	(Majhi et al., 2022, Majhi, 2023, Chambers et al., 12 Jul 2024)
Geodesic Estimation	Landmark/Nyström MDS, Spherelet Geodesics	(Shamai et al., 2016, Li et al., 2019)
Topological	Vineyard Distance	(Arulandu et al., 28 Oct 2025)
Hyperbolic	HyperDGA, Hyperbolic distribution matching	(Medbouhi et al., 12 Apr 2024, Li et al., 30 May 2025)

Summary-Statistic Based: CM Distance

The key instance is the CM distance, which measures the Mahalanobis distance between empirical feature means, with the inverse of the uniform feature covariance as kernel. For datasets $D_1, D_2$ with feature means $\theta_1, \theta_2$ and invertible covariance $\operatorname{Cov}(S)$ , $\operatorname{CM}(D_1, D_2; S)^2 = (\theta_1 - \theta_2)^\top [\operatorname{Cov}(S)]^{-1} (\theta_1 - \theta_2)$ This uniquely arises as the only Mahalanobis-type distance on datasummary means invariant to feature reparameterization and sample relabeling (Tatti, 2019).

Optimal Transport Dataset Distances

For labeled data in $X \times Y$ , the OTDD distance is the minimum expected ground cost over couplings $\pi$ between empirical samples, where ground cost blends feature and label geometry: $d_Z\Big((x, y), (x', y')\Big) = \big[ d_X(x, x')^p + W_p^p(\alpha_y, \alpha_{y'}) \big]^{1/p}$ with $W_p$ the Wasserstein distance between label-conditional feature distributions. The overall dataset distance is the corresponding OT objective. The sliced OTDD (s-OTDD) projects datapoints and label-conditionals into 1D, computes Wasserstein-1 distances in each slice, and averages, yielding a scalable approximation highly correlated with OTDD (Nguyen et al., 31 Jan 2025).

Manifold and Alignment-Free Intrinsic Distances

The Log-Euclidean Signature (LES) uses graph Laplacian-based diffusion operators as intrinsic summaries, compares them via Frobenius norm of their log-spectra after regularization and truncation: $d_\mathrm{LES}(W_1, W_2)^2 = \sum_{i=1}^K [\log(\hat{\lambda}_i^{(1)}+\gamma) - \log(\hat{\lambda}_i^{(2)}+\gamma)]^2$ This enables comparison of datasets with different size, feature-dimensions, and no alignment (Shnitzer et al., 2022).

Geometric Graph and Merge Tree Metrics

Classic measures—GED (geometric edit distance), GGD (geometric graph distance)—are based on minimal-cost sequences of vertex/edge insertions, deletions, and translations, or on minimal-cost inexact matchings. The graph mover's distance (GMD) relaxes GGD to an Earth-Mover problem on adjacency-length vectors, allowing for polynomial time computation (Majhi, 2023). The labeled merge-tree interleaving distance computes the $L^\infty$ norm between matrices derived from the merge trees under directional transforms, integrated over directions (Chambers et al., 12 Jul 2024).

Geodesic Approximations

Methods such as landmark-based Nyström approximations, spherelet-based local geodesic regressions, and classical scaling (MDS) are used for datasets assumed to be sampled from manifolds. These methods approximate global geodesic matrices and can reconstruct high-fidelity embeddings or distance predictions (Shamai et al., 2016, Li et al., 2019).

Topology-Informed and Non-Euclidean Distances

Vineyard distance aggregates the total metricized motion (weighted arc-length) of persistent homological features along a homotopy between two functions/datasets, situating itself between $L^p$ and Wasserstein diagram distances in sensitivity (Arulandu et al., 28 Oct 2025). Hyperbolic alignment metrics such as HyperDGA and hyperbolic centroid-losses are designed for hierarchical data representations and exploit geodesic and centroid structure in negatively curved geometry (Medbouhi et al., 12 Apr 2024, Li et al., 30 May 2025).

3. Computational Aspects and Scalability

Computational complexity is a key limiting factor in the applicability of geometric dataset distances.

Summary-statistic methods (CM) have $O(n^3)$ time for general features, $O(n)$ for certain binary/linear cases (Tatti, 2019).
Optimal transport methods for empirical datasets scale as $O(n^3\log n)$ (Hungarian), $O(n^2\log n/\epsilon^3)$ (Sinkhorn) where $n$ is dataset size. s-OTDD achieves $O(L n (d+\log n))$ per $L$ slices (Nguyen et al., 31 Jan 2025).
Graph-based distances (GGD, GED) are $\mathcal{NP}$ -hard; relaxations (GMD) are $O(n^3)$ (Majhi, 2023).
Nyström and spherelet geodesic approximations are quasi-linear or $O(n k^2)$ , where $k$ is landmark or local neighborhood size (Shamai et al., 2016, Li et al., 2019).
Topological (vineyard) distances involve dynamic persistence calculations; with vineyard updates, complexity is $O(\mathrm{PH} + m\log m)$ , with $\mathrm{PH}$ the cost for persistence on a single complex (Arulandu et al., 28 Oct 2025).

Empirical benchmarks show that scalable approximations (e.g., s-OTDD, landmark-MDS, GMD) retain high correlation with transfer, clustering or alignment goals while accommodating data of size $n\gg 10^4$ (Nguyen et al., 31 Jan 2025, Majhi, 2023, Shamai et al., 2016).

4. Theoretical Properties and Invariances

These distances are designed to satisfy desirable metric, pseudo-metric, or invariance properties depending on the goal.

CM distance: uniqueness under Mahalanobis-type structure, representation-invariance, and label-symmetry. It is a pseudo-metric; adding redundant features or enlarging sample size scales it predictably (Tatti, 2019).
OTDD, s-OTDD: true metrics on joint measures, stable under permutations and disjoint label sets, converge as sample sizes grow (Alvarez-Melis et al., 2020, Nguyen et al., 31 Jan 2025).
GMD: pseudo-metric; fails separability but satisfies triangle inequality and symmetry; robust to small geometric perturbations (Majhi, 2023).
Labeled merge-tree and vineyard distances: invariant under rigid motions of the ambient space, but may lack the triangle inequality (merge-tree), or be sensitive to the path of homotopy (vineyard) (Chambers et al., 12 Jul 2024, Arulandu et al., 28 Oct 2025).
Hyperbolic distances: preserve group invariance under isometries of hyperbolic space; centroid distances in Lorentz models encode tree-like hierarchies automatically (Medbouhi et al., 12 Apr 2024, Li et al., 30 May 2025).

These structural properties facilitate robust deployment in settings where invariance to alignment, feature scaling, or graph representation is required.

5. Application Domains and Empirical Findings

Geometric dataset distances have been validated across multiple domains:

Text and language modeling: Intrinsic distances (CM, LES, OTDD) reveal genre, epoch, and topic structure in corpora, and predict transfer learning hardness, outperforming nongeometric or purely label-based distances (Tatti, 2019, Shnitzer et al., 2022, Alvarez-Melis et al., 2020).
Vision and domain transfer: OTDD, s-OTDD, and GMD preserve class semantics, correlate with transfer gaps and augmentation efficacy, and discover semantic alignments even for disjoint label sets (Nguyen et al., 31 Jan 2025, Alvarez-Melis et al., 2020, Majhi, 2023).
Graphs and biological data: GMD and labeled merge-tree distances achieve high clustering accuracy for handwritten letter graphs, leaf morphologies, and sensor-derived network data; merge-tree and vineyard distances extract topological distinctions in shapes or spatial distributions (Majhi, 2023, Chambers et al., 12 Jul 2024, Arulandu et al., 28 Oct 2025).
Manifold and latent geometry: Landmark-based geodesic approximations and spherelet geodesics yield high spectral fidelity and improved downstream clustering, regression, and density estimation in simulation and real data (Shamai et al., 2016, Li et al., 2019).
Hierarchical and hyperbolic representations: Hyperbolic alignment metrics (HyperDGA, centroid matching) show strong monotonicity with noise and biological hierarchy, outperforming Euclidean analogues in single-cell and tree-structured data, and stabilizing distillation in deep learning (Medbouhi et al., 12 Apr 2024, Li et al., 30 May 2025).
Topological time-series and function monitoring: Vineyard distance detects both global and local variations along interpolations, separating functional classes that are indistinguishable by $L^p$ or classic persistence distances (Arulandu et al., 28 Oct 2025).

6. Limitations, Open Problems, and Future Directions

Major challenges include computational intractability for exact combinatorial distances (e.g., GGD, GED), unclear metrics on partial or attributed graphs, and sensitivity to feature selection in summary-statistic-based methods. Some pseudo-metric distances lack separability or the triangle inequality. Scalability for high-throughput settings is addressed via slicing, Nyström, and surrogate models, but this may come at the cost of decreased discriminative power for fine-grained differences.

Open problems suggested in the literature include tightening stability results to depend on local rather than global graph structure (GMD), extending merge-tree and vineyard-based approaches to multi-parameter or higher-dimensional filtrations, and further integrating geometric and topological statistics with probabilistic or distribution-matching frameworks.

7. Summary Table: Key Properties and Complexities

Distance Type	Metricity	Alignment required	Complexity	Notable Invariance
CM (Mahalanobis)	Pseudo-metric	No	$O(n^3)$	Feature, label symmetry
OTDD / s-OTDD	Metric	No	$O(n^3)$ / $O(nL)$	Disjoint label, feature scaling
LES	Metric (lower bound)	No	$O(K N^2)$	Feature ordering, regularization
GGD, GED	Metric	Yes	NP-hard	Geometric and combinatorial equiv.
GMD	Pseudo-metric	Yes (ordered)	$O(n^3)$	Hausdorff stability
Spherelet Geodesics	Metric (approximate)	No	$O(n^2)$	Intrinsic manifold geometry
Vineyard	Not necessarily metric	No	$O(N^2\log N)$	Homotopy path dependent
HyperDGA/Hyperbolic	Pseudo-metric	No	$O(m \log m)$	Hyperbolic isometry, hierarchy

The field continues to evolve rapidly with innovative approaches to efficiently and robustly quantify geometric, topological, and hierarchical differences between datasets for use in transfer learning, clustering, domain adaptation, and manifold discovery.