Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapted Wasserstein Distances

Updated 1 July 2025
  • Adapted Wasserstein distances extend standard optimal transport to stochastic processes by requiring couplings to respect temporal or causal information structures.
  • These distances are crucial in dynamic applications like stochastic optimization, finance, and machine learning, ensuring robustness and Lipschitz continuity of value functions and metrics under model changes.
  • Empirical estimation is challenging with standard methods but is achieved through adapted empirical measures, kernel smoothing, and leveraging sharp bounds that relate AW distances to classical Wasserstein.

Adapted Wasserstein distances extend the classical Wasserstein (optimal transport) framework to probability laws on stochastic processes, incorporating the time structure or filtration naturally arising in applications such as stochastic optimization, mathematical finance, and machine learning. Unlike traditional Wasserstein distances, which compare static probability measures and ignore data’s temporal evolution, adapted Wasserstein distances enforce coupling constraints that respect the filtration, enabling robust analysis of dynamic, information-driven systems.

1. Fundamentals and Motivation

The adapted Wasserstein distance (often denoted AWp\mathcal{AW}_p for order pp) generalizes optimal transport to settings where time or causality structures matter. For standard Wasserstein distance Wp(μ,ν)W_p(\mu,\nu), all couplings (joint distributions) between μ\mu and ν\nu are admissible; in contrast, AWp\mathcal{AW}_p only admits bicausal couplings—those that are non-anticipative (adapted) in both coordinates. For two discrete-time stochastic processes (or path measures) μ,ν\mu, \nu on (Rd)T(\mathbb{R}^d)^T, the adapted Wasserstein distance of order pp is defined as: AWp(μ,ν)=infπCplbc(μ,ν)(t=1Txtytpdπ(x,y))1/p,\mathcal{AW}_p(\mu, \nu) = \inf_{\pi \in \mathrm{Cpl}^{bc}(\mu, \nu)} \left( \int \sum_{t=1}^{T} \|x_t - y_t\|^p \, d\pi(x, y) \right)^{1/p}, where the infimum is over all bicausal couplings π\pi; these are couplings where, at each time tt, the law of future values is conditionally independent of the other process’s future, given the past.

This structure is fundamental in applications where decisions/actions can only be based on information available up to the present time, as in multi-stage stochastic optimization, mathematical finance, and sequential learning.

2. Mathematical Properties and Comparison to Standard Wasserstein Distance

  • Causality/Adaptedness: AWp\mathcal{AW}_p is strictly stronger than WpW_p, imposing non-anticipativity on couplings. It is thus a larger (or equal) distance and a finer topology.
  • Principal Recursion: For T=2T=2,

AWp(μ,ν)=infγCpl(μ1,ν1)x1y1+Wp(μx1,νy1)  dγ(x1,y1),\mathcal{AW}_p(\mu, \nu) = \inf_{\gamma \in \mathrm{Cpl}(\mu_1, \nu_1)} \int \|x_1 - y_1\| + W_p(\mu_{x_1}, \nu_{y_1}) \; d\gamma(x_1, y_1),

with μx1\mu_{x_1}, νy1\nu_{y_1} conditional laws at time $2$ given x1,y1x_1, y_1.

  • Optimal Couplings: In the classical case, the optimal coupling may be deterministic (a map); in the adapted case, this rarely occurs—there may be non-unique or non-deterministic optimal bicausal couplings, or even non-Gaussian in otherwise Gaussian systems.
  • Continuity Properties: AWp\mathcal{AW}_p induces a stronger topology than WpW_p; certain operations (optimal stopping, dynamic programming, semimartingale decomposition) are continuous in AWp\mathcal{AW}_p, but not in WpW_p.

3. Estimation and Empirical Convergence

  • Empirical Challenges: Standard empirical measures do not, in general, converge under the adapted Wasserstein metric, unlike the classical setting.
  • Adapted Empirical Measures: To overcome this, adapted empirical measures (A-Emp) and adapted smoothed empirical measures (AS-Emp) are constructed, using quantization, smoothing, and adapted projection. These measures achieve mean, deviation, and almost sure convergence under AWp\mathcal{AW}_p for broad classes of processes. For example, smoothing empirical samples with Gaussian kernels and projecting onto adaptive grids (possibly with random shifts) achieves both discreteness and statistical convergence, even in high-dimensions or with multi-step dependence.
  • Convergence Rates: Under regularity, rates nearly match those for classical Wasserstein (e.g., n1/(dT)n^{-1/(dT)} in expectation for dTdT-dimensional processes, and n1/2n^{-1/2} in highly regular or 1D cases). Recent work gives fast convergence via kernel smoothing and sharp modulus-of-continuity bounds.

4. Explicit Bounds and Approximation via Non-causal Measures

  • Sharp Upper Bounds: A key technical advance is that AWp\mathcal{AW}_p can be sharply upper-bounded in terms of WpW_p and certain regularity quantities (e.g., the smoothness of the process’s transition kernels and tail behavior), enabling practical estimation via classical tools. For instance, for measures with Lipschitz conditional kernels, AW1(μ,ν)CW1(μ,ν)\mathcal{AW}_1(\mu, \nu) \leq C \sqrt{W_1(\mu,\nu)}.
  • Bi-Lipschitz Estimates: The adapted total variation distance (weighted appropriately) is shown to be linearly controlled by the classical total variation distance, with constants scaling only polynomially (not exponentially) with process length.
  • Practical Estimation: This allows leveraging computationally efficient classical Wasserstein solvers and empirical process theory for statistical estimation of AWp\mathcal{AW}_p, circumventing the need for explicit solved bicausal transport problems in many settings.

5. Applications in Stochastic Optimization, Finance, and Machine Learning

  • Stochastic Programming and Control: The AW distance is now a standard metric in robust multi-stage stochastic optimization. Value functions for dynamic programming, optimal stopping, or sequential utility maximization are Lipschitz continuous in AWp\mathcal{AW}_p; that is,

V(μ)V(ν)CAWp(μ,ν),|V(\mu) - V(\nu)| \leq C \mathcal{AW}_p(\mu, \nu),

for a constant CC independent of the models μ,ν\mu, \nu.

  • Mathematical Finance: In pricing and hedging problems, the AW metric ensures that hedge errors, superhedging prices, and risk measures (such as AVaR) are robustly controlled (Lipschitz continuity) under changes in model, no matter how significant the temporal structure.
  • Causal Inference and Time Series Learning: AW distances (and their causal generalizations) are used to measure differences between dynamic generative models, establish the continuity of policy performance, or robustify reinforcement learning algorithms.

6. Algorithmic and Statistical Developments

  • Hybrid and Approximate Metrics: Hybrid (adapted) Wasserstein distances have been developed for scalable clustering and shrinkage estimation, decomposing shape, location, and scale effects to enable efficient high-dimensional computation.
  • Smoothing and Interpolation: Entropic regularization yields well-posed, efficiently solvable adapted transport problems, with explicit formulas for multidimensional Gaussian processes—including the entropic adapted Wasserstein distance.
  • Explainability and Attribution: Recent work leverages neuralization and layer-wise relevance propagation to attribute Wasserstein distances (or adapted analogues) to features, samples, or subspaces, making these metrics interpretable in practical machine learning diagnostics and shift analysis.

7. Theoretical Foundations and Future Directions

  • Causal Optimal Transport and Graphical Extensions: Generalizations of adapted Wasserstein distances respect not just time but arbitrary causal structures (as in DAGs), enabling robust analysis of domain adaptation, causal inference, and intervention.
  • Probabilistic and Geometric Structures: The AW space is a geodesic space (though incomplete), with explicit geodesics in the space of filtered processes and adapted versions of the Bures-Wasserstein distance for Gaussian laws. Skorokhod-type representation theorems show that convergence in AW is equivalent to LpL^p convergence of process representatives on a common basis.
  • Algorithmic Advances: Ongoing work concerns robust and efficient algorithms for adapted optimal transport, scalable empirical estimation, and robust learning for time series under adapted constraints.

Feature Classical Wasserstein (WpW_p) Adapted Wasserstein (AWp\mathcal{AW}_p)
Information structure None (static) Bicausal/adapted (temporal/causal)
Empirical convergence Yes Only with smoothing or adapted empirical
Application to dynamics/control Not robust Lipschitz robustness for value functions
Topological strength Coarser Finer, sensitive to information flow
Computational complexity Lower (standard OT solvers exist) Higher; bounds via WpW_p now available
Statistical estimation Classical theory Recent sharp bounds via regularity/smoothing

Adapted Wasserstein distances serve as essential metrics for comparing and analyzing the laws of stochastic processes in modern statistical, optimization, and machine learning applications. By enforcing couplings that respect temporal or graphical causality, they ensure meaningful, robust, and interpretable results in dynamic and information-driven contexts.