Papers
Topics
Authors
Recent
Search
2000 character limit reached

Causal Sinkhorn Discrepancy: Theory & Applications

Updated 23 January 2026
  • Causal Sinkhorn Discrepancy (CSD) is an entropy-regularized causal optimal transport metric that enforces non-anticipative constraints between stochastic processes.
  • It employs entropic regularization to yield continuous transport plans, ensuring computational tractability and robust performance in contextual DRO.
  • CSD supports robust optimization through strong dual formulations and continuous Gibbs kernel mixtures, enhancing interpretability and stability in worst-case analysis.

The Causal Sinkhorn Discrepancy (CSD) is an entropy-regularized variant of the causal optimal transport (OT) metric, tailored to enforce both causal consistency and continuous transport plans when comparing stochastic processes or joint probability distributions. Designed to address challenges in contextual distributionally robust optimization (DRO), CSD provides a tractable and interpretable means for quantifying divergence between probability measures while preserving non-anticipative structures essential for modeling causal systems (Zhang et al., 16 Jan 2026).

1. Formal Definition and Mathematical Structure

Let XRdx\mathcal{X} \subseteq \mathbb{R}^{d_x} and YRdy\mathcal{Y} \subseteq \mathbb{R}^{d_y} be Polish (or normed) spaces. Given two probability measures P,QP(X×Y)\mathbb{P}, \mathbb{Q} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y}), reference measures μM(X×Y)\mu \in \mathcal{M}(\mathcal{X} \times \mathcal{Y}), νXM(X)\nu_{\mathcal{X}} \in \mathcal{M}(\mathcal{X}), νYM(Y)\nu_{\mathcal{Y}} \in \mathcal{M}(\mathcal{Y}), regularization parameter ϵ0\epsilon \ge 0 and p[1,)p \in [1, \infty), define the cost:

cp((x^,y^),(x,y)):=xx^p+yy^p.c_p\big((\hat x, \hat y), (x, y)\big) := \|x - \hat x\|^p + \|y - \hat y\|^p.

A coupling γΓ(P,Q)\gamma \in \Gamma(\mathbb{P}, \mathbb{Q}) is causal if for ((X^,Y^),(X,Y))γ((\hat X, \hat Y), (X, Y)) \sim \gamma, it satisfies X ⁣ ⁣ ⁣Y^X^X \perp\!\!\!\perp \hat Y \mid \hat X. The collection of all such couplings is Γc(P,Q)\Gamma_c(\mathbb{P}, \mathbb{Q}).

The pp–Causal Sinkhorn Discrepancy Rp(P,Q)R_p(\mathbb{P}, \mathbb{Q}) is

Rp(P,Q):=(infγΓc(P,Q)  E((x^,y^),(x,y))γ[cp((x^,y^),(x,y))+ϵH(γμνXνY)])1/p,R_p(\mathbb{P}, \mathbb{Q}) := \Bigg( \inf_{\gamma \in \Gamma_c(\mathbb{P}, \mathbb{Q})}\; \mathbb{E}_{((\hat x, \hat y), (x, y))\sim\gamma} \big[\, c_p((\hat x, \hat y), (x, y)) + \epsilon H(\gamma \mid \mu \otimes \nu_{\mathcal{X}} \otimes \nu_{\mathcal{Y}}) \big] \Bigg)^{1/p},

where the relative entropy is

H(γμνXνY)=Eγ[logdγd(μdνXdνY)].H(\gamma \mid \mu \otimes \nu_{\mathcal{X}} \otimes \nu_{\mathcal{Y}}) = \mathbb{E}_\gamma \Big[ \log \frac{d\gamma}{d (\mu\, d\nu_{\mathcal{X}}\, d\nu_{\mathcal{Y}})} \Big].

For ϵ=0\epsilon = 0, this reduces to the causal pp-Wasserstein distance; for ΓcΓ\Gamma_c \to \Gamma (dropping causal restrictions), the non-causal Sinkhorn distance is recovered.

2. Causal Consistency and Entropic Regularization

The CSD distinguishes itself from the standard Sinkhorn discrepancy—a purely entropic OT metric—by enforcing causal consistency. Standard Sinkhorn distance over Γ(P,Q)\Gamma(\mathbb{P}, \mathbb{Q}) allows couplings to “anticipate” future outcomes (i.e., XX may depend on Y^\hat Y), potentially violating natural temporal or informational orderings. In contrast, CSD restricts to Γc(P,Q)\Gamma_c(\mathbb{P}, \mathbb{Q}) where, conditioning on past covariates X^\hat X, future covariates XX must not depend on unseen outcomes Y^\hat Y. This ensures the resulting transport plans do not introduce feedback loops or causally implausible information flows. In the context of DRO, this yields ambiguity sets that exclude alternatives where future decisions would be made with foreknowledge of future uncertainties [(Zhang et al., 16 Jan 2026), Definition 2.2 and Figure 1.2].

Entropic regularization (ϵ>0\epsilon > 0) compels the solution to yield continuous, rather than purely discrete, couplings, enhancing the smoothness and computational tractability of the resulting discrepancy metric.

3. Duality, Strong Dual Formulation, and Optimization

Given an empirical distribution P^P(X×Y)\hat{\mathbb{P}} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y}), the robust optimization problem

vP:=maxPP(X×Y),  Rp(P^,P)pρpE(x,y)P[Ψ(f(x),y)]v_P := \max_{\mathbb{P} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y}),\; R_p(\hat{\mathbb{P}}, \mathbb{P})^p \le \rho^p} \mathbb{E}_{(x,y) \sim \mathbb{P}}[\Psi(f(x), y)]

admits a strong dual formulation:

vD=infλ0{λρp+Ex^P^x[λϵlogXexp(g(x^,x;λ)λϵ)dνX(x)]},v_D = \inf_{\lambda \geq 0} \Bigg\{ \lambda \rho^p + \mathbb{E}_{\hat x \sim \hat{\mathbb{P}}_x} \left[ \lambda \epsilon \log \int_{\mathcal{X}} \exp\left(\frac{g(\hat x, x; \lambda)}{\lambda \epsilon}\right) d\nu_{\mathcal{X}}(x) \right] \Bigg\},

where

g(x^,x;λ)=Ey^P^yx=x^[λϵlogYexp(Ψ(f(x),y)λcp((x^,y^),(x,y))λϵ)dνY(y)].g(\hat x, x; \lambda) = \mathbb{E}_{\hat y \sim \hat{\mathbb{P}}_{y|x=\hat x}} \left[ \lambda \epsilon \log \int_{\mathcal{Y}} \exp \left( \frac{\Psi(f(x), y) - \lambda c_p((\hat x, \hat y), (x,y))}{\lambda \epsilon} \right) d\nu_{\mathcal{Y}}(y) \right].

Under mild conditions, strong duality holds and the infimum is strictly attained at a unique λ>0\lambda^*>0 (by strict convexity in λ\lambda) [(Zhang et al., 16 Jan 2026), Theorem 3.1].

4. Structure of the Worst-Case Distribution

If λ>0\lambda^*>0 solves the dual problem, the worst-case distribution P\mathbb{P}^* achieving vPv_P exhibits the density

dP(x,y)d(νXνY)=E(x^,y^)P^[α(x^)β(x^,y^;x)  exp{r(x^,x)+s(x^,y^;x,y)}],\frac{d\mathbb{P}^*(x,y)}{d(\nu_{\mathcal{X}} \otimes \nu_{\mathcal{Y}})} = \mathbb{E}_{(\hat x, \hat y) \sim \hat{\mathbb{P}}} \bigg[ \alpha(\hat x)\, \beta(\hat x, \hat y; x)\; \exp\big\{ r(\hat x, x) + s(\hat x, \hat y; x, y) \big\} \bigg],

with \begin{align*} r(\hat x, x) &:= \mathbb{E}{\hat y | \hat x} \left[ \log \int \exp(s(\hat x, \hat y; x, y)) d\nu{\mathcal{Y}}(y) \right], \ s(\hat x, \hat y; x, y) &:= \frac{\Psi(f(x), y) - \lambda* c_p((\hat x, \hat y), (x, y))}{\lambda* \epsilon}, \ \alpha(\hat x) &:= \left[ \int_{\mathcal{X}} \exp(r(\hat x, x')) d\nu_{\mathcal{X}}(x') \right]{-1}, \ \beta(\hat x, \hat y; x) &:= \left[ \int_{\mathcal{Y}} \exp(s(\hat x, \hat y; x, y')) d\nu_{\mathcal{Y}}(y') \right]{-1}. \end{align*} This density is a continuous mixture of Gibbs kernels, in contrast to the discrete support typically arising in non-regularized or non-causal OT settings [(Zhang et al., 16 Jan 2026), Theorem 3.2].

5. Theoretical Properties: Metricity, Regularity, and Optimization

  • Metric properties and continuity: Rp(,)R_p(\cdot, \cdot) is a true metric on P(X×Y)\mathcal{P}(\mathcal{X} \times \mathcal{Y}) under causal couplings. The entropic regularization (ϵ>0\epsilon > 0) ensures strong convexity and smoothness in the dual formulation, providing stability and numerical tractability.
  • Tractability: The dual reformulation reduces the CSD evaluation to a single-level minimization over λ\lambda involving expectations of log-sum-exp functions. For parametric decision rules fθf_\theta, the outer min-max problem can be expressed as a multi-level stochastic compositional optimization.
  • Optimization methods: Sample-Average Approximation (SAA) achieves δ\delta-optimality in O(δ2)O(\delta^{-2}) Monte Carlo samples, and a Stochastically Corrected Stochastic Compositional (SCSC) gradient method achieves an ε\varepsilon-stationary point in O(ε4)O(\varepsilon^{-4}) gradient iterations, which matches lower complexity bounds for nonconvex stochastic optimization [(Zhang et al., 16 Jan 2026), Propositions 4.2, 4.3; Theorems 5.1, 5.2].

6. Causal Sinkhorn Discrepancy in Contextual DRO

CSD enables the formulation of ambiguity sets in contextual DRO that respect both entropy-induced continuity and causal transport restrictions. Given empirical joint law P^\hat{\mathbb{P}}, the ambiguity set is

Aρ:={PP(X×Y):Rp(P^,P)pρp}.\mathcal{A}_\rho := \{ \mathbb{P} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y}) : R_p(\hat{\mathbb{P}}, \mathbb{P})^p \leq \rho^p \}.

This set excludes (i) discrete couplings for ϵ>0\epsilon>0 (thus enforcing continuity), and (ii) causally implausible transports (enforcing X ⁣ ⁣ ⁣Y^X^X \perp\!\!\!\perp \hat Y | \hat X). The resulting contextual DRO problem is:

inffFsupPAρE(x,y)P[Ψ(f(x),y)].\inf_{f \in \mathcal{F}} \sup_{\mathbb{P} \in \mathcal{A}_\rho} \mathbb{E}_{(x,y) \sim \mathbb{P}} [\Psi(f(x), y)].

This structure ensures that worst-case analysis never incorporates information unavailable under natural causal orderings and that predictions/policies are robust to plausible distributional perturbations consistent with the observed data-generating process [(Zhang et al., 16 Jan 2026), Section 6].

7. Summary Table: Key Distinctions

Property Standard Sinkhorn Causal Sinkhorn Discrepancy (CSD)
Coupling Constraint All couplings (Γ\Gamma) Causal couplings (Γc\Gamma_c)
Feedback Allowed Yes No (enforces X ⁣ ⁣ ⁣Y^X^X \perp\!\!\!\perp \hat Y | \hat X)
Entropic Regularization Optional Required for continuity, tractability
Metricity Non-causal OT metric Causal metric on P(X×Y)\mathcal{P}(\mathcal{X} \times \mathcal{Y})
Worst-case Distribution Discrete/Continuous Continuous mixture of Gibbs kernels

The Causal Sinkhorn Discrepancy is a foundational tool for robust, interpretable, and causally consistent optimization in modern machine learning and statistics (Zhang et al., 16 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causal Sinkhorn Discrepancy (CSD).