Papers
Topics
Authors
Recent
2000 character limit reached

Cumulative Wasserstein Drift

Updated 2 January 2026
  • Cumulative Wasserstein drift is the sum of Wasserstein distances between successive probability measures, quantifying total distributional change over time.
  • It underpins finite-sample guarantees, dynamic regret bounds, and concentration inequalities in stochastic processes and online optimization.
  • Optimal weighting schemes leveraging cumulative drift balance bias and variance, guiding parameter choices in nonstationary and robust methods.

Cumulative Wasserstein drift quantifies the total “distance” traversed by a time-evolving sequence of probability measures, typically under the Wasserstein metric, over a given time horizon. It is a central nonstationarity measure in stochastic processes, online optimization, Markov dynamics, and empirical process theory, capturing both instantaneous and aggregated changes in distributions. Rigorous frameworks for cumulative Wasserstein drift underpin concentration inequalities, dynamic regret bounds, convergence rates for flows in the space of measures, and finite-sample guarantees in distributionally robust optimization.

1. Definition and Foundational Concepts

Let {Pt}t=1T\{P_t\}_{t=1}^T be a sequence of probability measures on a Polish space Ξ\Xi. The pp-Wasserstein distance between PtP_t and Pt+1P_{t+1} at each time tt is given by

Δt:=Wp(Pt,Pt+1)\Delta_t := W_p(P_t, P_{t+1})

The unweighted cumulative Wasserstein drift over TT periods is

DT:=t=1T1ΔtD_T := \sum_{t=1}^{T-1} \Delta_t

This sum captures the total geometric “movement” of the underlying data-generating law as measured in the Wasserstein space. In settings with weighted empirical estimators or time-decayed observations, the natural generalization is the LpL_p-norm-type drift: Dp(w):=(t=1Twt(Tt+1)p)1/pρD_p(w) := \left( \sum_{t=1}^T w_t (T-t+1)^p \right)^{1/p} \rho where wΔTw \in \Delta_T is a vector of nonnegative weights, and ρ\rho is a uniform bound on Δt\Delta_t (Keehan et al., 21 Oct 2025).

2. Weighted Empirical Measures and Effective Sample Size

In nonstationary environments, weighted empirical measures are used to balance effective sample size against the impact of distributional drift: P^w:=t=1Twtδξt\hat P_w := \sum_{t=1}^T w_t \delta_{\xi_t} with ξtPt\xi_t \sim P_t. A key metric is the effective sample size

neff(w):=1t=1Twt2n_{\mathrm{eff}}(w) := \frac{1}{\sum_{t=1}^T w_t^2}

which quantifies the statistical reliability of P^w\hat P_w under the weighting scheme ww. The interplay between Dp(w)D_p(w) and neff(w)n_{\mathrm{eff}}(w) is critical for controlling estimation error and variance in time-evolving data (Keehan et al., 21 Oct 2025).

3. Finite-Sample Concentration and Nonstationary Robustness

A central technical result is a concentration inequality for Wasserstein distances in the nonstationary, weighted setting: Pr[Wp(P^w,PT+1)ϵ]exp(c1neff(w)((ϵDp(w))+pc2neff(w)q)+2)\Pr\bigl[ W_p(\hat P_w, P_{T+1}) \geq \epsilon \bigr] \leq \exp\left( -c_1 n_{\mathrm{eff}}(w) \left( (\epsilon - D_p(w))_+^p - c_2 n_{\mathrm{eff}}(w)^{-q} \right)_+^2 \right) where c1,c2>0,c_1, c_2 > 0, and q(0,1/2)q \in (0, 1/2) depend on the geometry of Ξ\Xi and pp (Keehan et al., 21 Oct 2025). For sufficiently large ϵ\epsilon,

Pr[Wp(P^w,PT+1)ϵ]exp(c14neff(w)(ϵDp(w))2p)\Pr\bigl[ W_p(\hat P_w, P_{T+1}) \geq \epsilon \bigr] \leq \exp\left( -\frac{c_1}{4} n_{\mathrm{eff}}(w) (\epsilon - D_p(w))^{2p} \right)

This quantifies deviations of the empirical process in the presence of cumulative nonstationary drift, explicitly balancing sample variance and drift-induced bias.

4. Optimal Weighting: Variance–Drift Tradeoff

Optimal weights ww^* simultaneously control bias due to drift and estimation variance, solving

maxwΔTneff(w)(ϵDp(w))+2p\max_{w \in \Delta_T} n_{\mathrm{eff}}(w) (\epsilon - D_p(w))_+^{2p}

The unique structure of the solution is

wt=(c1c2(Tt+1)p)+w_t = (c_1' - c_2' (T - t + 1)^p)_+

with scalars c1,c20c_1', c_2' \geq 0 determined by simplex constraints. As pp grows, the optimal scheme exhibits sharper cutoff of past (older) data, reducing to pure sliding-window or exponential-decay weighting depending on parameter choices. Explicit calibrations,

s(2ϵ/ρ1)/3,α3/(ϵ/ρ+1)s \approx \left\lfloor (2\epsilon/\rho - 1)/3 \right\rfloor,\quad \alpha \approx 3/(\epsilon/\rho+1)

arise for windowing and exponential smoothing in the p=1p=1 case, providing optimal parameter choices in terms of desired accuracy ϵ\epsilon, drift bound ρ\rho, and Wasserstein order pp (Keehan et al., 21 Oct 2025).

5. Cumulative Drift in Dynamic Optimization and Learning

In online convex optimization where objective distributions {Pt}\{\mathbb{P}_t\} evolve, the cumulative Wasserstein drift DTD_T enters directly into dynamic regret bounds: DT:=t=1T1Wp(Pt,Pt+1)D_T := \sum_{t=1}^{T-1} W_p(\mathbb{P}_t, \mathbb{P}_{t+1}) and the sequence of minimizers xtx_t^* satisfies

t=1T1xt+1xtCDT\sum_{t=1}^{T-1} \| x_{t+1}^* - x_t^* \| \leq C \, D_T

The corresponding dynamic regret is lower-bounded by an O(DT)O(D_T) term—cumulative drift sets the intrinsic limit on performance in adapting to distributional changes, with all other regret contributions (noise, initialization) being controllable via algorithmic parameters (Shames et al., 2020).

6. Wasserstein Drift in PDEs, Stochastic Flows, and Markov Chains

The notion of cumulative Wasserstein drift generalizes to continuous-time measure-valued flows:

  • For gradient flows in P2(Rd)P_2(\mathbb{R}^d),

L(0,T):=0T(vt(x)2ρt(x)dx)1/2dtL(0, T) := \int_0^T \left( \int \|v_t(x)\|^2 \rho_t(x) dx \right)^{1/2} dt

where vtv_t is the instantaneous velocity field from the continuity equation. The total path-length controls convergence rates and is uniformly bounded in terms of the initial suboptimality of the functional F(ρ0)infFF(\rho_0) - \inf F (Chizat et al., 16 Jul 2025).

  • In measure-valued SPDEs and diffusions, the time integral of instantaneous drift or squared gradient quantifies both cumulative displacement in Wasserstein space and the action or Fisher information over time (Delarue et al., 2024).
  • In discrete-time Markov chains, geometric contractivity plus one-step non-contractive “drift” yields cumulative bounds:

W(μPn,νPn)κnW(μ,ν)+1κn1κδW(\mu P^n, \nu P^n) \leq \kappa^n W(\mu, \nu) + \frac{1 - \kappa^n}{1 - \kappa} \delta

where δ\delta is the per-step drift and κ<1\kappa < 1 is the contraction rate. The second term encodes the aggregated perturbation—the “cumulative drift” of the Markov process (Madras et al., 2011).

7. Applications and Broader Significance

Cumulative Wasserstein drift is a central concept in:

These frameworks provide precise nonasymptotic characterizations, parameter choices for weighting schemes, and convergence rates that systematically account for nonstationarity and time-varying complexity in modern stochastic systems.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Cumulative Wasserstein Drift.