Graph Wasserstein Distance (GWD)

Updated 23 February 2026

Graph Wasserstein Distance (GWD) is a metric family that uses optimal transport to compare graphs by representing them as probability measures over nodes and structural features.
It extends classical Wasserstein approaches with Gromov–Wasserstein and fused strategies to capture both node embeddings and intrinsic graph structure.
GWD underpins state-of-the-art methods for graph matching, classification, data augmentation, and robust alignment across diverse graph sizes and applications.

Graph Wasserstein Distance (GWD) defines a family of distances for comparing graphs based on optimal transport (OT) principles, leveraging both geometric and combinatorial properties of graphs. GWD and its relatives provide mathematically rigorous, alignment-free, and invariant metrics for graph matching, classification, barycenter computation, and related tasks across graph learning, topological data analysis, and cross-modal representation learning. The central idea is to characterize each graph by measures (probability distributions) over relevant objects associated to the graph (nodes, embeddings, structural features, subgraphs, graph signals), and to compare those distributions—often across different graph sizes or structures—using Wasserstein or Gromov–Wasserstein (GW) distances. Contemporary GWD formulations extend beyond classical node–feature OT to domain-relational (GW-type) distances, and may incorporate edge features, probabilistic node embeddings, topological invariants, or smooth graph signals. GWD has led to state-of-the-art algorithms for large-scale graph alignment, classification, data augmentation, cross-modal transfer, and topological analysis, with a growing portfolio of scalable, theoretically founded solvers.

1. Mathematical Foundations of GWD

1.1 Classical Wasserstein and Node Embedding Approaches

The classical OT-based “Graph Wasserstein Distance” constructs empirical measures on vector representations of each node. Given two graphs $G = (V,E)$ and $H = (V',E')$ , suppose that for each $v \in V$ and $v' \in V'$ , we compute node embeddings $x_v, x'_{v'} \in \mathbb{R}^d$ (e.g., using Weisfeiler–Lehman propagation, node2vec, GNNs, or kernel signatures). These give rise to empirical measures

$\mu_G = \frac{1}{|V|} \sum_{v \in V} \delta_{x_v}, \qquad \mu_H = \frac{1}{|V'|} \sum_{v' \in V'} \delta_{x'_{v'}}.$

Define a ground cost $C_{ij} = d(x_i, x'_j)$ , typically the Euclidean or cosine distance. The GWD is then the 1-Wasserstein (Earth Mover's) distance: $W_1(\mu_G, \mu_H) = \min_{P \in \Gamma} \sum_{i=1}^{n}\sum_{j=1}^{m} P_{ij} C_{ij},$ where $\Gamma$ enforces prescribed marginals (often uniform).

This framework underpins several contemporary methods, including the Wasserstein Weisfeiler–Lehman kernel for attributed graphs, which achieves state-of-the-art classification and enables positive-definiteness for categorical labels (Togninalli et al., 2019).

1.2 Graphon and Structured Signal Approaches

Advanced GWD variants model each graph as a distribution of “graph signals” or higher-level descriptions. An influential method treats each undirected, weighted graph $\mathcal{G} = (V,E,W)$ with Laplacian $L$ as supporting a zero-mean Gaussian on $\mathbb{R}^N$ , with covariance $\Sigma = L^\dagger$ . The GWD between such graphs is the closed-form 2-Wasserstein distance between Gaussians (Maretic et al., 2020): $\mathcal{W}_2^2(\mathcal{N}(0, \Sigma_1), \mathcal{N}(0, \Sigma_2)) = \mathrm{Tr}(\Sigma_1) + \mathrm{Tr}(\Sigma_2) - 2\, \mathrm{Tr}\big(\big(\Sigma_1^{1/2} \Sigma_2 \Sigma_1^{1/2}\big)^{1/2}\big).$ This lifts the comparison to smooth signal manifolds, capturing both spectral and structural information. Alignment between graphs of different sizes exploits soft assignment matrices subject to (relaxed) combinatorial constraints.

1.3 Gromov–Wasserstein and Structural Metrics

The GW distance generalizes OT by comparing not the distributions per se, but intra-space pairwise relations (e.g., adjacency or shortest-path distances). For graphs $G_1=(V_1,A_1,\mu_1)$ and $G_2=(V_2,A_2,\mu_2)$ as metric measure spaces: $\mathrm{GW}^2(G_1,G_2) = \min_{T \in \Pi(\mu_1,\mu_2)} \sum_{i,i'}\sum_{j,j'} |A_{1,ii'} - A_{2,jj'}|^2\, T_{ij} T_{i'j'}.$ In GW, transport plans map mass between nodes in ways that optimally respect the graphs' internal structures. This formulation is naturally invariant to node permutations and supports comparison of non-aligned and differently sized graphs (Li et al., 2023, Li et al., 2022, Ponti, 2024).

2. Model Extensions: Feature Fusion, Relaxations, and Topology

2.1 Fused and Edge-Attributed Distances

GWDs have been extended to incorporate node and edge attributes through fused distances, most notably the Fused Network Gromov–Wasserstein (FNGW) distance. Here, the comparison cost fuses three terms: node-feature discrepancy, edge-feature discrepancy, and structural (pairwise) discrepancy (Yang et al., 2023): $\mathrm{FNGW}_{\alpha,\beta}(g,\tilde{g}) = \min_{\pi} \sum_{i,j,k,l}\Big[ (1-\alpha-\beta)\,d_\Psi(f_i,\tilde f_j)^2 + \beta\,|A_{ik} - \tilde{A}_{jl}|^2 + \alpha\,\|E_{ik} - \tilde{E}_{jl}\|^2 \Big]\pi_{ij}\pi_{kl}.$ Tuning $\alpha, \beta$ interpolates between structure-only, attribute-only, and mixed regimes. Such models admit efficient solvers via conditional gradient/Frank–Wolfe and yield barycenters (mean graphs) supporting edge and node features.

2.2 Relaxed and Robust Formulations

Relaxations of marginal constraints (e.g., semi-relaxed GW (Vincent-Cuaz et al., 2021), robust GW (Kong et al., 2023), unbalanced variants) support flexible matching in the presence of unmatched nodes or outliers. Semi-relaxed GW fixes only the source marginal, allowing selection of target subgraphs: $\text{srGW}_q^q(G, C') = \min_{T \in U(h, m)} \sum_{i,j,k,l} |C_{ij} - C'_{kl}|^q T_{ik} T_{jl}$ and enables subgraph selection and fast dictionary learning by leveraging row-wise decomposable solvers.

In robust GW, robustification is achieved by allowing marginal perturbations within KL-divergence balls and adding penalty terms, yielding significantly improved resilience to outliers or missing data in matching and alignment tasks (Kong et al., 2023).

2.3 Topological and Signal-Theoretic Variants

GWD methodology extends to the comparison of topological structures, including Reeb graphs, by equipping graphs with pairwise intrinsic metrics and probability measures derived from persistence images. The Reeb–GW distance compares decorated graphs and is proven stable under small perturbations of underlying scalar fields (Chambers et al., 1 Jul 2025).

3. Algorithms and Scalability

3.1 Core Optimization Problems

The backbone of GWD computation is solving large-scale, typically non-convex quadratic programs in the coupling (transport) matrix $T$ . Algorithms include:

Entropic regularization and Sinkhorn scaling for fast, GPU-amenable computations (Li et al., 2022, Li et al., 2022).
Frank–Wolfe and conditional gradient methods for efficiently handling linearized subproblems in structure-aware costs.
Proximal/Bregman alternating projection (BAPG, hBPG) schemes guaranteeing convergence and maintaining feasibility or trading it for speed, particularly under the Luo-Tseng error bound (Li et al., 2023, Li et al., 2022).

A common computational bottleneck is the $O(n^4)$ scaling in dense GW solvers, due to four-way cost tensor contractions. Recent advances such as Spar–GW (Li et al., 2022) employ Monte Carlo importance sampling for $O(n^{2+\delta})$ scaling, enabling practical computation on graphs with thousands of nodes.

3.2 Stochastic and Variational Strategies

For non-convex alignments and soft assignments, stochastic gradient descent with variational reparametrizations and entropy-regularized Dykstra operators is employed. This is crucial for escaping poor local minima in one-to-many alignments and for large graphs (Maretic et al., 2020).

3.3 Multi-Marginal and Barycenter Computations

GW-based barycenter computation and multi-graph matching generalize two-graph distances to multi-marginal problems, supporting operations such as graph data augmentation, clustering, and transfer (Beier et al., 2022, Ponti, 2024). Alternating minimization and multi-marginal Sinkhorn updates are used to maintain practical scaling.

4. Empirical Results and Applicability

4.1 Classification and Clustering

Across graph classification benchmarks (MUTAG, PTC, ENZYMES, IMDB, BZR, COX2, DHFR, ER), GWD and its variants consistently outperform or match state-of-the-art kernels, classic matching algorithms, and even deep learning baselines (e.g., GNNs) (Togninalli et al., 2019, Maretic et al., 2020, Huang et al., 2020, Yang et al., 2023). FNGW, FLCS, and probabilistic node-embedding GWDs show especially high discriminative power and efficiency with appropriate regularization.

4.2 Alignment, Completion, and Data Augmentation

One-to-many GWD alignment robustly handles distorted graphs, varying graph sizes, and non-aligned communities, outperforming GW and $\ell_2$ baselines in alignment and community recovery under severe distortions (Maretic et al., 2020). GW barycenters enable non-Euclidean data augmentation, graphon estimation, and transfer, empirically boosting downstream classifier performance (Ponti, 2024). srGW and robust GW achieve efficient, interpretable subgraph dictionary learning, completion, and partitioning, scaling to large datasets and outperforming classical and recent graph dictionary learners (Vincent-Cuaz et al., 2021, Kong et al., 2023).

Fused GWDs are leveraged in cross-modal representation learning for tasks like speech recognition, outperforming previous OT-based approaches by incorporating both node and structural alignment (Lu et al., 19 May 2025). Topological GWDs for Reeb graph comparison provide stable, topologically informed metrics with state-of-the-art shape retrieval performance (Chambers et al., 1 Jul 2025).

5. Theoretical Properties and Limit Laws

5.1 Metric Invariance and Stability

GWD, GW, and their variants are invariant under isometric transformations, permutation of node indices, and alignment-preserving relabelings. For cases based purely on structural metrics, exact zero distance is achieved if and only if there is a measure-preserving isometry or isomorphism between the compared objects (Rioux et al., 2024, Ponti, 2024).

The stability of these distances with respect to perturbations in the graph (node/edge removal, feature noise) is established, especially for Reeb GW and robust variants (Chambers et al., 1 Jul 2025, Kong et al., 2023). Recent advances provide finite-sample convergence, limit laws, and estimators for GW in the discrete and entropic settings, supporting valid statistical inference and hypothesis tests such as graph isomorphism testing (Rioux et al., 2024).

5.2 Complexity and Limitations

While exact GWD computation is infeasible for large graphs due to quartic scaling, randomized approximations (Spar–GW), scalable Sinkhorn solvers, and decomposable relaxations (srGW, FNGW, BAPG/hBPG) have reduced practical complexity substantially. A trade-off between alignment sharpness and speed is observed: more relaxed methods (single-loop BAPG, stochastic approaches) favor large-scale applications with tolerable marginal infeasibilities, while double-loop or hybrid solvers are reserved for exact correspondence tasks (Li et al., 2022, Li et al., 2023).

6. Current Challenges and Future Directions

Research challenges include designing faster GW/BAPG-type solvers for extremely large graphs, developing theoretical understanding of kernel definiteness in continuous-attribute settings, and integrating structural/topological GWDs directly into end-to-end neural architectures. Recent work also points toward kernelizing srGW for deep learning, incorporating edge-feature fusion in srGW and FNGW, studying approximation errors between relaxed and standard GW, and extending statistical theory (variance, limit laws) for GWD-based estimators (Vincent-Cuaz et al., 2021, Rioux et al., 2024, Yang et al., 2023).

Emerging applications involve multi-modal and multi-marginal GWD for transfer learning, multi-graph barycenters, and generative modeling; topological variants for shape and manifold learning; and robust, outlier-resilient GW for partially observed or corrupted data.

Key references:

Wasserstein-based Graph Alignment (Maretic et al., 2020)
Exploiting Edge Features in Graphs with Fused Network Gromov–Wasserstein Distance (Yang et al., 2023)
Semi-relaxed Gromov–Wasserstein divergence with applications on graphs (Vincent-Cuaz et al., 2021)
LCS Graph Kernel Based on Wasserstein Distance in Longest Common Subsequence Metric Space (Huang et al., 2020)
Wasserstein Weisfeiler-Lehman Graph Kernels (Togninalli et al., 2019)
A Stable and Theoretically Grounded Gromov–Wasserstein Distance for Reeb Graph Comparison using Persistence Images (Chambers et al., 1 Jul 2025)
Efficient Approximation of Gromov-Wasserstein Distance Using Importance Sparsification (Li et al., 2022)
A Convergent Single-Loop Algorithm for Relaxation of Gromov-Wasserstein in Graph Data (Li et al., 2023)
Multi-Marginal Gromov-Wasserstein Transport and Barycenters (Beier et al., 2022)
Outlier-Robust Gromov-Wasserstein for Graph Data (Kong et al., 2023)
Limit Laws for Gromov–Wasserstein Alignment with Applications to Testing Graph Isomorphisms (Rioux et al., 2024)
Graph data augmentation with Gromow-Wasserstein Barycenters (Ponti, 2024)
Fast and Provably Convergent Algorithms for Gromov–Wasserstein in Graph Data (Li et al., 2022)