Adapted Wasserstein Distances

Updated 1 July 2025

Adapted Wasserstein distances extend standard optimal transport to stochastic processes by requiring couplings to respect temporal or causal information structures.
These distances are crucial in dynamic applications like stochastic optimization, finance, and machine learning, ensuring robustness and Lipschitz continuity of value functions and metrics under model changes.
Empirical estimation is challenging with standard methods but is achieved through adapted empirical measures, kernel smoothing, and leveraging sharp bounds that relate AW distances to classical Wasserstein.

Adapted Wasserstein distances extend the classical Wasserstein (optimal transport) framework to probability laws on stochastic processes, incorporating the time structure or filtration naturally arising in applications such as stochastic optimization, mathematical finance, and machine learning. Unlike traditional Wasserstein distances, which compare static probability measures and ignore data’s temporal evolution, adapted Wasserstein distances enforce coupling constraints that respect the filtration, enabling robust analysis of dynamic, information-driven systems.

1. Fundamentals and Motivation

The adapted Wasserstein distance (often denoted $\mathcal{AW}_p$ for order $p$ ) generalizes optimal transport to settings where time or causality structures matter. For standard Wasserstein distance $W_p(\mu,\nu)$ , all couplings (joint distributions) between $\mu$ and $\nu$ are admissible; in contrast, $\mathcal{AW}_p$ only admits bicausal couplings—those that are non-anticipative (adapted) in both coordinates. For two discrete-time stochastic processes (or path measures) $\mu, \nu$ on $(\mathbb{R}^d)^T$ , the adapted Wasserstein distance of order $p$ is defined as: $\mathcal{AW}_p(\mu, \nu) = \inf_{\pi \in \mathrm{Cpl}^{bc}(\mu, \nu)} \left( \int \sum_{t=1}^{T} \|x_t - y_t\|^p \, d\pi(x, y) \right)^{1/p},$ where the infimum is over all bicausal couplings $\pi$ ; these are couplings where, at each time $t$ , the law of future values is conditionally independent of the other process’s future, given the past.

This structure is fundamental in applications where decisions/actions can only be based on information available up to the present time, as in multi-stage stochastic optimization, mathematical finance, and sequential learning.

2. Mathematical Properties and Comparison to Standard Wasserstein Distance

Causality/Adaptedness: $\mathcal{AW}_p$ is strictly stronger than $W_p$ , imposing non-anticipativity on couplings. It is thus a larger (or equal) distance and a finer topology.
Principal Recursion: For $T=2$ ,

$\mathcal{AW}_p(\mu, \nu) = \inf_{\gamma \in \mathrm{Cpl}(\mu_1, \nu_1)} \int \|x_1 - y_1\| + W_p(\mu_{x_1}, \nu_{y_1}) \; d\gamma(x_1, y_1),$

with $\mu_{x_1}$ , $\nu_{y_1}$ conditional laws at time $2$ given $x_1, y_1$ .

Optimal Couplings: In the classical case, the optimal coupling may be deterministic (a map); in the adapted case, this rarely occurs—there may be non-unique or non-deterministic optimal bicausal couplings, or even non-Gaussian in otherwise Gaussian systems.
Continuity Properties: $\mathcal{AW}_p$ induces a stronger topology than $W_p$ ; certain operations (optimal stopping, dynamic programming, semimartingale decomposition) are continuous in $\mathcal{AW}_p$ , but not in $W_p$ .

3. Estimation and Empirical Convergence

Empirical Challenges: Standard empirical measures do not, in general, converge under the adapted Wasserstein metric, unlike the classical setting.
Adapted Empirical Measures: To overcome this, adapted empirical measures (A-Emp) and adapted smoothed empirical measures (AS-Emp) are constructed, using quantization, smoothing, and adapted projection. These measures achieve mean, deviation, and almost sure convergence under $\mathcal{AW}_p$ for broad classes of processes. For example, smoothing empirical samples with Gaussian kernels and projecting onto adaptive grids (possibly with random shifts) achieves both discreteness and statistical convergence, even in high-dimensions or with multi-step dependence.
Convergence Rates: Under regularity, rates nearly match those for classical Wasserstein (e.g., $n^{-1/(dT)}$ in expectation for $dT$ -dimensional processes, and $n^{-1/2}$ in highly regular or 1D cases). Recent work gives fast convergence via kernel smoothing and sharp modulus-of-continuity bounds.

4. Explicit Bounds and Approximation via Non-causal Measures

Sharp Upper Bounds: A key technical advance is that $\mathcal{AW}_p$ can be sharply upper-bounded in terms of $W_p$ and certain regularity quantities (e.g., the smoothness of the process’s transition kernels and tail behavior), enabling practical estimation via classical tools. For instance, for measures with Lipschitz conditional kernels, $\mathcal{AW}_1(\mu, \nu) \leq C \sqrt{W_1(\mu,\nu)}$ .
Bi-Lipschitz Estimates: The adapted total variation distance (weighted appropriately) is shown to be linearly controlled by the classical total variation distance, with constants scaling only polynomially (not exponentially) with process length.
Practical Estimation: This allows leveraging computationally efficient classical Wasserstein solvers and empirical process theory for statistical estimation of $\mathcal{AW}_p$ , circumventing the need for explicit solved bicausal transport problems in many settings.

5. Applications in Stochastic Optimization, Finance, and Machine Learning

Stochastic Programming and Control: The AW distance is now a standard metric in robust multi-stage stochastic optimization. Value functions for dynamic programming, optimal stopping, or sequential utility maximization are Lipschitz continuous in $\mathcal{AW}_p$ ; that is,

$|V(\mu) - V(\nu)| \leq C \mathcal{AW}_p(\mu, \nu),$

for a constant $C$ independent of the models $\mu, \nu$ .

Mathematical Finance: In pricing and hedging problems, the AW metric ensures that hedge errors, superhedging prices, and risk measures (such as AVaR) are robustly controlled (Lipschitz continuity) under changes in model, no matter how significant the temporal structure.
Causal Inference and Time Series Learning: AW distances (and their causal generalizations) are used to measure differences between dynamic generative models, establish the continuity of policy performance, or robustify reinforcement learning algorithms.

6. Algorithmic and Statistical Developments

Hybrid and Approximate Metrics: Hybrid (adapted) Wasserstein distances have been developed for scalable clustering and shrinkage estimation, decomposing shape, location, and scale effects to enable efficient high-dimensional computation.
Smoothing and Interpolation: Entropic regularization yields well-posed, efficiently solvable adapted transport problems, with explicit formulas for multidimensional Gaussian processes—including the entropic adapted Wasserstein distance.
Explainability and Attribution: Recent work leverages neuralization and layer-wise relevance propagation to attribute Wasserstein distances (or adapted analogues) to features, samples, or subspaces, making these metrics interpretable in practical machine learning diagnostics and shift analysis.

7. Theoretical Foundations and Future Directions

Causal Optimal Transport and Graphical Extensions: Generalizations of adapted Wasserstein distances respect not just time but arbitrary causal structures (as in DAGs), enabling robust analysis of domain adaptation, causal inference, and intervention.
Probabilistic and Geometric Structures: The AW space is a geodesic space (though incomplete), with explicit geodesics in the space of filtered processes and adapted versions of the Bures-Wasserstein distance for Gaussian laws. Skorokhod-type representation theorems show that convergence in AW is equivalent to $L^p$ convergence of process representatives on a common basis.
Algorithmic Advances: Ongoing work concerns robust and efficient algorithms for adapted optimal transport, scalable empirical estimation, and robust learning for time series under adapted constraints.

Feature	Classical Wasserstein ( $W_p$ )	Adapted Wasserstein ( $\mathcal{AW}_p$ )
Information structure	None (static)	Bicausal/adapted (temporal/causal)
Empirical convergence	Yes	Only with smoothing or adapted empirical
Application to dynamics/control	Not robust	Lipschitz robustness for value functions
Topological strength	Coarser	Finer, sensitive to information flow
Computational complexity	Lower (standard OT solvers exist)	Higher; bounds via $W_p$ now available
Statistical estimation	Classical theory	Recent sharp bounds via regularity/smoothing

Adapted Wasserstein distances serve as essential metrics for comparing and analyzing the laws of stochastic processes in modern statistical, optimization, and machine learning applications. By enforcing couplings that respect temporal or graphical causality, they ensure meaningful, robust, and interpretable results in dynamic and information-driven contexts.

PDF Markdown Chat (Upgrade)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now