Optimal Transport and GANs

Updated 2 April 2026

Optimal Transport is a mathematical framework that quantifies and minimizes differences between distributions using geometric metrics like the Wasserstein distance.
GANs leveraging OT principles overcome common challenges such as mode collapse, ensuring robust training and enhanced sample diversity.
Advanced OT-GAN techniques, including Sinkhorn divergence and Voronoi ensemble methods, offer improved statistical guarantees and generative performance.

Optimal transport (OT) and generative adversarial networks (GANs) are intimately connected through their shared objective: matching complex probability distributions, typically in high-dimensional spaces. OT provides a principled geometric and statistical foundation for quantifying and minimizing distributional discrepancies, most notably via Wasserstein distances. GAN frameworks leveraging OT principles have led to robust and stable generative models, improved sample diversity, provable statistical guarantees, and new algorithmic architectures. The OT formalism also clarifies and unifies a wide spectrum of GAN objectives and training regimes, connecting deep generative modeling to geometric statistics, convex duality, and measure-theoretic transport.

1. Mathematical Foundations: Optimal Transport, Wasserstein Distances, and Duality

Optimal transport classically seeks a map or coupling that transforms one distribution into another at minimal cost. The Monge formulation constructs a deterministic transport map, while the Kantorovich relaxation allows mass splitting via couplings $\pi \in \Pi(\mu,\nu)$ between probability measures $\mu$ (source) and $\nu$ (target). For cost $c(x,y)$ (e.g., squared Euclidean), the OT cost is

$W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y).$

The dual (Kantorovich) formulation, e.g. for $c(x,y)=\|x-y\|$ , is

$W_1(\mu,\nu)=\sup_{\|f\|_{\mathrm{Lip}}\leq 1} \left\{ \int f\,d\mu - \int f\,d\nu \right\},$

where the supremum is over 1-Lipschitz functions $f$ . These dual potentials appear directly as GAN discriminators under appropriate regularization.

Semi-discrete OT arises when one distribution is continuous and one is discrete (e.g., a sum of Diracs), naturally leading to Voronoi/Laguerre tessellations and connections to geometric clustering. In high dimensions, entropic regularization and the Sinkhorn divergence provide computationally efficient, unbiased, and differentiable approximations to the true OT distance, suitable for minibatch stochastic optimization.

2. OT-based GANs: Algorithms, Losses, and Theoretical Guarantees

Many modern GAN objectives are closely related to OT theory, especially via the Wasserstein GAN (WGAN), which minimizes the Wasserstein-1 distance between the generated and real data distributions using a neural 1-Lipschitz critic. This approach overcomes issues of mode collapse and vanishing gradients seen in classical GANs.

The OT-GAN family extends this further:

Mini-batch Sinkhorn GANs leverage regularized OT losses (Sinkhorn divergence) over minibatch-constructed cost matrices in a learned feature space, yielding unbiased gradients and improved sample quality (Salimans et al., 2018).
Informative OT-GANs incorporate additional mutual information terms and structured regularization in the OT plan to enforce disentanglement in latent spaces (Bréchet et al., 2019).
AE-OT eliminates the min-max saddle-point by training a discriminator to solve the Kantorovich dual in the latent space, extracting the Brenier map directly for generative use (Liu et al., 2018).

Statistical guarantees have been developed for these OT-based GANs. For GANs minimizing the Sinkhorn divergence, the rate of convergence to the target distribution can be shown to be independent of ambient data dimension, instead depending only on generator complexity and latent space dimension, provided that both the generator and the latent distribution are learned jointly. This yields improved sample complexity compared to fixing the prior (Luise et al., 2020).

3. Semi-Discrete OT, Voronoi Structure, and Ensemble GANs

Semi-discrete OT provides a powerful mechanism for partitioning complex data distributions. In "k-GANs" (Ambrogioni et al., 2019), deep generative models are structured as an ensemble of $k$ GANs, each assigned to a Voronoi cell induced by a set of learnable Dirac prototypes in data space. Each generator learns to transport the prototype mass to the empirical data restricted to its cell, via a local WGAN loss. The alternation of generator and prototype updates is analogous to k-medoids clustering, and the ensemble structure directly mitigates mode collapse. Each generator only needs to model a localized region (or mode), instead of the full, potentially multimodal distribution.

This Voronoi-ensemble GAN approach yields near-perfect mode coverage in multimodal toy examples and maintains distinct clusters in real-world data (MNIST, Fashion-MNIST), outperforming standard single-GAN baselines, which are prone to mode collapse.

4. Direct Estimation and Consistency of OT Maps via Neural Networks

Recently, research has addressed learning not just matching distributions, but the OT maps themselves:

Direct GAN estimation of OT maps is made possible using strongly regularized neural networks with explicit Lipschitz constraints, e.g., GroupSort networks (González-Sanz et al., 2022). Here, the generator is directly regularized to approximate the unique Monge map between source and target, with uniform convergence guarantees under smoothness and regularity assumptions. This is crucial for tasks needing explicit, certified transfer between distributions, such as counterfactual reasoning or structured data imputation.
Explicit saddle-point optimization and min-max constructions in the ambient space (not just latent space) allow recovering the optimal transport map under the quadratic cost, with theoretical error bounds and strong empirical performance for both image generation and image restoration tasks (Rout et al., 2021).

Research on these methods has established the first statistically consistent estimators for neural-network-based OT maps, with practical scaling to high dimensions. These approaches bypass some of the limitations of WGAN, which only enforces Lipschitzness on the critic and does not directly compute OT maps.

5. Regularization, Statistical Stability, and Robustness

OT regularization is critical to computational stability and robust generalization in GANs. Entropic or quadratic regularization of the OT objective leads to smooth generator losses, allowing the use of first-order optimizers and automatic differentiation. Regularized OT GANs have provable convergence to stationary points and are more robust to hyperparameter selection and to the stochastic geometry of finite samples (Sanjabi et al., 2018).

OT relaxation frameworks enable controlled bias-variance tradeoffs and explicit control over the number of generator vs. critic updates, with performance comparable to unregularized methods but improved computational efficiency (Mahdian et al., 2019). Similarly, robust OT formulations that relax marginal constraints via $f$ -divergences or allow sample reweighting (e.g., to downweight outliers) yield GANs stable under considerable contamination, improving performance on corrupted datasets without loss of quality on clean data (Balaji et al., 2020). These robust GANs introduce adaptive per-sample weighting mechanisms during adversarial training.

6. Geometry, Disentanglement, and Structure via OT

OT theory offers a geometric perspective that clarifies the internal structure of GANs:

The Alexandrov–Brenier convex geometry view leads to generative models where only the Kantorovich dual (discriminator) is trained; the generator/transport map is then extracted via analytical formulas, eliminating adversarial competition and improving mode coverage in low-dimensional settings (Lei et al., 2017).
Structured regularization of the transport plan, particularly focusing on mutual information between prescribed latent subspaces and data, leads to generative models with improved disentanglement, yielding more interpretable and robust latent representations (Bréchet et al., 2019).
In latent space, OT theory allows for mathematically principled corrections to standard vector arithmetic, interpolation, and sampling operations, ensuring that latent traversals preserve the model’s prior, thereby enhancing sample fidelity and robustness of downstream synthetic data (Agustsson et al., 2017).

These geometric and structural insights underpin both the stability and interpretability of modern OT-based GAN frameworks.

7. Broader Implications, Extensions, and Applications

The synergy between OT and GANs has motivated novel applications beyond classic image generation:

Domain adaptation and population modeling: Unbalanced OT GANs, which allow variable mass transport, are applied to problems with population changes or missing/misaligned support, such as cell lineage tracing or domain adaptation between distributions with changing mass or classes (Yang et al., 2018, Prasad et al., 2020).
Mean-field control/Games and PDEs: Optimal transport and GAN frameworks have been unified with mean-field control and game-theoretic methods, leading to adversarial networks that solve coupled PDE systems (e.g., Hamilton–Jacobi–Bellman and Fokker–Planck) through gradient-based, adversarial training (Cao et al., 2020).
Empirical evaluation and benchmarking: Systematic studies quantify the consistency and reliability of WGAN-like OT-discriminator solvers in high dimensions, showing that while no practical solver is a perfect OT ground-truth estimator, gradient-penalty variants most reliably deliver usable gradients for generator updates. The Wasserstein-1 objective should not be interpreted as a true OT cost in high dimension, but its gradients can nonetheless yield effective generative models (Korotin et al., 2022).

Overall, OT provides a powerful framework for building and analyzing deep generative models, with a formal theory that can be leveraged for robust, geometrically principled, and statistically justified training of GANs and their variants. This connection has led to new algorithmic paradigms, practical improvements, and theoretical understanding essential to the development of next-generation probabilistic models.