Graph Optimal Transport Loss

Updated 5 January 2026

Graph Optimal Transport Loss is a framework that extends optimal transport theory to graphs by leveraging both node features and topology for precise comparisons.
It enables practical tasks such as GNN fine-tuning, graph matching, and cross-domain alignment through enforced structural and distributional consistency.
Algorithmic approaches like Sinkhorn iterations and amortized OT solvers ensure computational efficiency while maintaining strong theoretical guarantees.

Graph Optimal Transport (GOT) Loss is a class of loss functions and algorithmic frameworks that extend optimal transport (OT) theory to structured data living on graphs, enabling rigorous comparison, alignment, regularization, and prediction tasks in graph-based machine learning. By leveraging both node-level features and graph topology, GOT losses provide a principled means of measuring or enforcing distributional and structural alignment between graphs, graph-encoded signals, or learned representations. The development of GOT has led to significant gains in areas such as graph neural network (GNN) fine-tuning, graph matching, cross-domain entity alignment, supervised and self-supervised graph learning, and efficient amortized OT plan prediction.

1. Mathematical Formulations of Graph Optimal Transport

A unifying feature of GOT losses is their grounding in rigorous OT with explicit incorporation of graph structure. This is achieved via node- and edge-level cost functions, constraints that reflect the graph’s adjacency or metric, and in many cases entropic or other regularization for computational tractability.

Node and Structure-Aware OT

Let $G = (V, E, A)$ be a graph with $n = |V|$ nodes, adjacency $A \in \{0,1\}^{n \times n}$ , and possibly node features $X \in \mathbb{R}^{n \times d}$ . For two sets of node embeddings (e.g., pre-trained $X^S$ and fine-tuned $X^T$ ), local GOT can be formulated as:

Measures: $\mu = \sum_{i=1}^n q_i \delta_{x^S_i},\quad \nu = \sum_{j=1}^n q_j \delta_{x^T_j}$ , $q_i = 1/n$
Local cost (e.g., half-cosine): $C_{ij} = \frac{1 - \cos(x^S_i, x^T_j)}{2}$
Mask: Only permit transport when $A_{ij} = 1$
Masked coupling constraint: $P \in U(A, q, q) = \{P \ge 0 : P1_n = q, P^T1_n = q, P_{ij} = 0\text{ if }A_{ij}=0\}$

The GOT loss is then

$L_{\mathrm{GOT}}^\varepsilon = \min_{P \in U(A,q,q)} \langle P, C \rangle - \varepsilon H(P)$

with $H(P) = -\sum_{i,j} P_{ij} \log P_{ij}$ and $\varepsilon > 0$ for Sinkhorn regularization (Zhang et al., 2022).

Cross-Domain and Fused Structure-Feature Losses

In cross-domain or multi-graph scenarios, GOT is generalizable as:

$\min_{T \in \Pi(a,b)} \sum_{i,j} T_{ij}[\,\lambda\,c(x_i, y_j) + (1-\lambda) \sum_{i',j'} T_{i'j'} L(x_i, x_{i'}, y_j, y_{j'})]$

where $c$ is node-level cost and $L$ measures discrepancies between structural relations (Chen et al., 2020).

Fused Gromov–Wasserstein (FGW) and extensions such as Partially-Masked FGW and fused unbalanced GW cover settings with distinct feature and structure terms, marginals penalties, and partial maskings for variable graph sizes (Krzakala et al., 2024, Mazelet et al., 21 May 2025).

Graph Signal and Filter OT

Some GOT losses compare graphs via the laws of smooth signals on the graph, leading to closed-form 2-Wasserstein distances between Gaussians with covariance derived from the pseudoinverse Laplacian (or graph filters):

$W_2^2(\nu_1, \nu_2) = \mathrm{Tr}(\Sigma_1) + \mathrm{Tr}(\Sigma_2) - 2\mathrm{Tr}[ (\Sigma_1^{1/2} \Sigma_2 \Sigma_1^{1/2})^{1/2}]$

with $\Sigma_i = L_i^\dagger$ or $g(L_i)g(L_i)^T$ for some filter $g$ (Maretic et al., 2019, Maretic et al., 2021).

2. Optimization and Algorithmic Procedures

Sinkhorn and Conditional Gradient Algorithms

GOT optimization often relies on entropic regularization and variants of the Sinkhorn–Knopp algorithm, whose mask-aware forms efficiently handle edge constraints and yield sparse plans in $O(m)$ per iteration for $m$ -edge graphs. For quadratic assignments (e.g., GW or FGW), block coordinate Frank–Wolfe procedures are standard, with each step involving linearized cost computation and a linear OT solve (Zhang et al., 2022, Krzakala et al., 2024).

Fast Amortized and Deep OT Solvers

Learned surrogates, such as the ULOT architecture, amortize classical FGW or unbalanced GW optimization, producing $O(n^2 N)$ forward-passes (for $N$ layers) with minimal loss in plan accuracy and substantial speedup over iterative classical solvers (Mazelet et al., 21 May 2025).

Mirror Descent and Bayesian Relaxation

For spectral or filter-based GOT distances involving alignment, mirror gradient descent and stochastic Bayesian exploration provide practical routes to optimization, with Sinkhorn projections and entropy-based step sizing facilitating convergence even in non-convex settings (Maretic et al., 2019, Maretic et al., 2021).

Dual and Newton Methods

Quadratically-regularized graph OT losses are efficiently addressed via dual Newton-type algorithms, exploiting Laplacian structure for rapid solution of flow-conservation-constrained QPs, retaining $O(m)$ per iteration complexity for sparse graphs (Essid et al., 2017).

3. Integration into Learning Objectives

GOT losses act as regularizers or primary loss terms in a wide spectrum of graph learning pipelines:

Fine-tuning GNNs: GOT regularization bridges pre-training and fine-tuning stages by enforcing the local alignment of node embeddings as determined by graph structure. The GTOT regularizer is added, with hyperparameter $\lambda$ , to the core task loss:

$L_{\text{total}} = \frac{1}{N}\sum_{i=1}^N \big[ \phi(f(G_i), y_i) + \lambda L_{\mathrm{GOT}}(A^{(i)}, q, q) \big]$

(Zhang et al., 2022)

Cross-domain/multi-modal alignment: GOT (node and structure aware) loss is added as an auxiliary term, enforcing both node-wise and relation-wise consistency between dynamic entity graphs during training of retrieval, captioning, or translation models (Chen et al., 2020).
Supervised graph prediction: In frameworks such as Any2Graph, permutation-invariant PM-FGW loss ties predicted graph outputs to reference graphs, handling variable node counts, padding, and unordered representations (Krzakala et al., 2024).
Self-supervised and contrastive learning: GOT plans are computed between branch and aggregated meta-path views in heterogeneous graphs to enable augmentation-free self-supervised training (Liu et al., 3 Jun 2025).

4. Empirical Observations and Hyperparameter Effects

Empirical results consistently reveal that GOT-based regularization and loss terms deliver improvements over classical pairwise or node-wise supervision:

GTOT-Tuning achieves 1–3 points average ROC-AUC gain over standard fine-tuning baselines (L2-SP, DELTA, BSS) on molecular property prediction, with the largest improvement under label scarcity (Zhang et al., 2022).
In scene graph generation, GOT loss improves mean Recall@K, with gains concentrated on rare predicates due to OT’s soft class similarity encoding (Kurita et al., 2023).
GOT mitigates negative transfer in multi-task settings and dynamically adapts penalization according to domain gaps (Zhang et al., 2022).
Mask ablations confirm that enforcing graph-local transport (adjacency-limited) is critical for maximal downstream accuracy, with all-ones masks reducing to standard Wasserstein distances and decreasing performance (Zhang et al., 2022).
Amortized OT solvers (ULOT) for FUGW achieve 10–100-fold speedups per alignment with sub-5% relative FUGW-loss error versus block-coordinate classical solvers, enabling scalable application to graphs with up to 1000 nodes (Mazelet et al., 21 May 2025).

Key hyperparameters include the entropic or quadratic regularization strength ( $\varepsilon,\gamma$ ), balancing weights ( $\lambda, \alpha$ ), OT plan mask (adjacency powers, $A^k$ ), and early-stopping or Sinkhorn iteration counts. Oversized hyperparameters (e.g., $\lambda$ too large) may dull the impact of the primary task loss.

5. Theoretical Guarantees and Structural Properties

GOT losses are constructed to guarantee permutation invariance, sub-differentiability, and stability:

Permutation invariance follows by construction, as all loss formulations depend only on cost matrices, padded/masked representations, and the OT plan, all of which are symmetric to node re-ordering (Krzakala et al., 2024).
Metric properties (identity, symmetry, triangle inequality) are formally satisfied on isomorphism classes or appropriate relaxed spaces (Maretic et al., 2019, Chen et al., 2020, Maretic et al., 2021).
For graph signal-based losses, the distance is sensitive to Laplacian spectrum, aligning graphs by their low-frequency eigenspaces, contrasting with pure structural (GW) approaches (Dong et al., 2020).
Envelope theorem guarantees backpropagation validity: the gradient of the minimum GOT value with respect to its input distribution or representation parameters is given by the solution’s optimal dual variables, ensuring seamless neural integration (Zhang et al., 2022, Essid et al., 2017).
The use of entropic or quadratic regularization renders the respective optimization strongly convex and differentiable, providing robustness and rapid convergence for both primal and dual solvers (Essid et al., 2017).

6. Applications and Extensions

GOT losses are applied across the following domains:

Fine-tuning of GNNs in low-label regimes, preserving structural knowledge across pre-training and adaptation (Zhang et al., 2022).
Cross-domain and cross-modal alignment for retrieval, visual question answering, captioning, translation, and summarization, establishing interpretable, sparse transport plans eminently suitable for downstream selection or visualization (Chen et al., 2020).
Graph matching and alignment via quadratic assignment/OT relaxations, achieving faster and sometimes superior alignment accuracy than classical Hungarian-based methods (FAQ), especially for large-scale graphs (Saad-Eldin et al., 2021).
Permutation-invariant supervised graph prediction, as in Any2Graph and PM-FGW-based decoders (Krzakala et al., 2024).
Self-supervised and unsupervised learning on heterogeneous graphs, where GOT plan alignment enables superior representation transfer without reliance on augmentations or hand-tuned sample selection (Liu et al., 3 Jun 2025).
Graph comparison and classification via Laplacian-based or filter-based signal distances, enabling alignment-insensitive and globally sensitive metrics (Maretic et al., 2019, Maretic et al., 2021).

7. Limitations and Computational Considerations

While GOT frameworks are highly flexible and empirically robust, certain challenges remain:

Full Gromov-Wasserstein or quadratic transport problems are computationally intensive for large graphs ( $O(n^4)$ in the naive implementation), though entropic regularization, fast Sinkhorn, and deep amortized solvers (ULOT) significantly mitigate this (Mazelet et al., 21 May 2025, Krzakala et al., 2024).
Non-convexity in unregularized GW/FGW can limit optimality; stochastic or mirror descent and hybrid relaxation are used in practice (Maretic et al., 2021).
Hyperparameter tuning is nontrivial and central to balancing transport, task, and regularization effects.
On extremely large or dense graphs, memory and runtime limits may necessitate sub-sampling, multi-scale coarsening, or linearization of quadratic terms.

Empirical evidence demonstrates that these limitations are surmountable for graphs with up to several thousand nodes, particularly when leveraging entropic regularization or amortized architectures.

Graph Optimal Transport Loss constitutes a cornerstone of modern graph-based learning and comparison, offering mathematically principled, computationally effective, and structurally grounded tools for the gamut of graph analysis, learning, and signal processing tasks (Zhang et al., 2022, Krzakala et al., 2024, Mazelet et al., 21 May 2025, Chen et al., 2020, Maretic et al., 2019).