Optimal Transport Semi-Supervised Learning

Updated 12 December 2025

Optimal transport semi-supervised learning is a framework that leverages both labeled and unlabeled data by constructing optimal couplings to enhance label propagation and domain adaptation.
It uses entropic and quadratic regularization to build adaptive affinity graphs that facilitate effective pseudo-labeling and scalable optimization.
Empirical results and theoretical guarantees demonstrate improved classification accuracy and robustness compared to traditional semi-supervised methods.

Optimal transport (OT) based semi-supervised approaches utilize the theory and algorithms of optimal transport to leverage both labeled and unlabeled data for improved machine learning performance, particularly in regimes with limited labeled samples. These methods exploit the geometry of the data manifold, construct global affinity structures, propagate labels with certitude control, and permit modular integration with deep learning architectures. The resulting frameworks enable adaptive neighborhood construction, robust pseudo-labeling, semantic distribution matching, and effective domain adaptation. This article systematically surveys key methodology, algorithmics, mathematical underpinnings, practical implementations, and empirical evidence for OT-based semi-supervised learning methods.

1. Mathematical Foundations of OT in Semi-Supervised Learning

A core principle of OT-based semi-supervised learning is the construction of couplings (transport plans) between empirical measures associated to labeled and unlabeled datasets. These plans are computed by minimizing a ground-cost between data points, often regularized to ensure tractability, sparsity, or robustness. Classical formulations include the Kantorovich problem: $\min_{\gamma\in U(a,b)}\;\langle \gamma, C \rangle,$ where $C_{ij}$ encodes pairwise ground costs (typically squared Euclidean) and $U(a,b)$ enforces prescribed marginals for labeled/unlabeled distributions. Regularization—via entropy for Sinkhorn solvers, quadratic penalty for hinge-type sparsification, or other means—enables both scalable inference and explicit control over structural features of the induced affinity matrix (Hamri et al., 2021, Matsumoto et al., 2022, Hamri et al., 2021).

Such affinity matrices can be interpreted as label-propagation graphs, pseudo-labeling mechanisms, or more generally as modules that regularize neural models, enabling propagation of label information and structure across the data manifold.

2. Graph-Based OT and Label Propagation Frameworks

Several foundational approaches exploit OT to build affinity graphs used for transductive or inductive semi-supervised learning. The workflow commonly proceeds as:

Affinity Construction: Compute an OT plan $T^*$ between labeled and unlabeled point clouds, leading to affinity weights $W_{ij}$ via normalization (typically column or row stochastic) (Hamri et al., 2021, Hamri et al., 2021).
Soft/Hard Label Assignment: Assign soft pseudo-label probabilities to unlabeled points based on affinity-weighted aggregation from labeled points, followed by hard label selection via $\arg\max$ .
Certainty Control and Iterative Enrichment: Use entropy-based certainty measures per unlabeled node (e.g., normalized entropy of the soft label distribution) to select only highly confident pseudo-labels per iteration. At each step, newly pseudo-labeled nodes are incorporated into the labeled set, and the process repeats, reducing uncertainty and ensuring finite convergence (Hamri et al., 2021, Hamri et al., 2021).
Inductive Extension: For out-of-sample prediction (inductive semi-supervised learning), a new point is coupled against the existing (labeled + pseudo-labeled) set via OT, and the label is assigned by affinity-weighted voting. This is formalized in the Optimal Transport Induction (OTI) method (Hamri et al., 2021).

These frameworks yield robust, data-adaptive label propagation algorithms that outperform traditional kernel- and neighbor-based methods with tight finite-time convergence guarantees.

3. OT-Regularized Neighborhood Graphs: Adaptive Connectivity

Quadratically regularized OT provides an alternative means of constructing affinity graphs for spectral and propagation-based SSL. Here, the problem

$\min_{\pi\in\Pi} \langle \pi, C \rangle + \varepsilon \|\pi\|_F^2,$

where $\Pi$ enforces symmetric transport and row-stochasticity, gives rise to sparsity-inducing, density-adaptive graphs with a single parameter $\varepsilon$ . The edge set and weights adapt to local sampling density automatically; regions of higher density induce denser connectivity, and areas of lower density yield sparser graphs (Matsumoto et al., 2022).

The adjacency matrix $W=\pi$ is directly used as the input for established semi-supervised propagation methods such as Label Learning with Gaussian Consistency (LLGC), enabling improved manifold learning (lower principal subspace error) and semi-supervised classification (notably, 98% accuracy on synthetic high-dimensional spirals with only 2.5% labels, compared to 70% for kNN+Gaussian) (Matsumoto et al., 2022).

4. OT in Deep and Structured SSL: Distribution Matching and Semantic Alignment

Recent methods integrate OT-based objectives with deep neural networks for semi-supervised classification, domain adaptation, and representation learning:

Pseudo-Labeling via Hierarchical OT: Matching distributions of feature clusters in labeled and unlabeled pools via Wasserstein distances (measure-of-measures OT) enables generation of high-quality pseudo-labels. These are incorporated as targets in the total loss for further network training. Entropic regularization (Sinkhorn) ensures scalable execution and effective regularization (Taherkhani et al., 2020).
Semantic Class-Reinforced OT: OTMatch introduces an OT-based loss aligning the student’s class-probability distribution with the teacher’s across augmentations, weighted by a cost matrix encoding semantic relationships between classes (e.g., inner products of classifier head weights). This drives the student not only to mimic hard pseudo-labels but also to respect semantic "geometry" among classes, yielding improvements in low-label regimes and outperforming SOTA baselines (Tan et al., 2023).
Multi-Modal and Multi-Instance SSL: Bag-level entropic OT losses align predicted label distributions across modalities (e.g., image/text bags), while learning a task-specific ground metric that captures label correlation structure. This setup supports both supervised and semi-supervised learning of multi-label, multi-modal data by jointly optimizing deep encoder architectures and the ground metric with Sinkhornized OT objectives (Yang et al., 2021).
Domain Adaptation: In cross-domain transfer scenarios with labeled source and unlabeled/partially labeled target data, OT-based coupling aligns source and target feature distributions, incorporates label-adaptive costs, and leverages self-paced ensemble techniques to address class imbalance. This approach realizes robust knowledge transfer across hospitals for clinical early warning tasks (Ding et al., 2021).

5. Probabilistic and Information-Theoretic OT SSL

Several works bring OT perspective to conditional density estimation and mutual information (MI) estimation in semi-supervised settings:

Inverse Entropic OT for Semi-Supervised Conditional Models: Data likelihood maximization with both paired and unpaired samples is shown to be equivalent to inverse entropic OT, leading to algorithms that jointly learn a conditional distribution $\pi^*(y|x)$ leveraging all available data. The optimization alternates over cost and dual potential parameterizations with closed-form expressions in Gaussian-mixture settings, achieving universal approximation and outperforming baselines on synthetic and real tabular tasks (Persiianov et al., 3 Oct 2024).
Mutual Information Estimation: When only small numbers of paired samples are available, LSMI-Sinkhorn alternates density-ratio fitting with entropic OT-based coupling of unpaired samples to improve SMI estimation accuracy. The approach efficiently scales to large sample sizes and demonstrates state-of-the-art performance on image matching and summarization tasks (Liu et al., 2019).

6. Theoretical Insights, Regularization, and Empirical Evidence

OT-based semi-supervised methods offer rigorous theoretical guarantees:

Regularization and Dimension Adaptivity: In distributionally robust OT (DRO), using unlabeled data to restrict the support of adversarial distributions narrows the worst-case uncertainty set, yielding improved generalization rates, particularly when data lie on low-dimensional manifolds (Blanchet et al., 2017).
Convexity and Convergence: Entropic and quadratically regularized formulations yield strongly convex optimization, ensuring uniqueness and scalability via Sinkhorn or Newton-type solvers. Incremental label-propagation schemes based on certainty thresholds are guaranteed to terminate in finitely many steps (Hamri et al., 2021, Hamri et al., 2021).
Empirical Benchmarks: Across standard tasks (CIFAR-10, SVHN, UCI, multi-view image classification), OT-based methods outperform classical graph-based and cluster-based SSL, as well as more recent deep learning SSL heuristics. Improvements are observed both in pseudo-label accuracy, robustness to hyperparameters, and downstream task metrics (test error, NMI, ARI, macro/micro AUC) (Tan et al., 2023, Taherkhani et al., 2020, Hamri et al., 2021, Yang et al., 2021, Matsumoto et al., 2022, Hamri et al., 2021).

7. Implementation Considerations and Future Directions

OT-based SSL methods require careful algorithmic engineering, particularly regarding cost computation, regularization parameter selection, and scalable optimization. Entropic regularization and closed-form kernel approximations are essential for making large-scale problems tractable. The construction or learning of ground cost matrices (for class semantics, modality correlation, or bag-level structure) is pivotal in practice. Open questions remain regarding automated hyperparameter tuning, online or streaming integration, and adaption to open-world class distributions.

Emerging directions include extending OT-regularized objectives to conditional generative models, diffusion models, and large-scale semi-paired or domain translation scenarios; leveraging OT for more general structured prediction; and integrating OT-principled regularization into end-to-end neural architectures (Persiianov et al., 3 Oct 2024, Gu et al., 2023).

Key References