Causal Sinkhorn Discrepancy: Theory & Applications

Updated 23 January 2026

Causal Sinkhorn Discrepancy (CSD) is an entropy-regularized causal optimal transport metric that enforces non-anticipative constraints between stochastic processes.
It employs entropic regularization to yield continuous transport plans, ensuring computational tractability and robust performance in contextual DRO.
CSD supports robust optimization through strong dual formulations and continuous Gibbs kernel mixtures, enhancing interpretability and stability in worst-case analysis.

The Causal Sinkhorn Discrepancy (CSD) is an entropy-regularized variant of the causal optimal transport (OT) metric, tailored to enforce both causal consistency and continuous transport plans when comparing stochastic processes or joint probability distributions. Designed to address challenges in contextual distributionally robust optimization (DRO), CSD provides a tractable and interpretable means for quantifying divergence between probability measures while preserving non-anticipative structures essential for modeling causal systems (Zhang et al., 16 Jan 2026).

1. Formal Definition and Mathematical Structure

Let $\mathcal{X} \subseteq \mathbb{R}^{d_x}$ and $\mathcal{Y} \subseteq \mathbb{R}^{d_y}$ be Polish (or normed) spaces. Given two probability measures $\mathbb{P}, \mathbb{Q} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y})$ , reference measures $\mu \in \mathcal{M}(\mathcal{X} \times \mathcal{Y})$ , $\nu_{\mathcal{X}} \in \mathcal{M}(\mathcal{X})$ , $\nu_{\mathcal{Y}} \in \mathcal{M}(\mathcal{Y})$ , regularization parameter $\epsilon \ge 0$ and $p \in [1, \infty)$ , define the cost:

$c_p\big((\hat x, \hat y), (x, y)\big) := \|x - \hat x\|^p + \|y - \hat y\|^p.$

A coupling $\gamma \in \Gamma(\mathbb{P}, \mathbb{Q})$ is causal if for $((\hat X, \hat Y), (X, Y)) \sim \gamma$ , it satisfies $X \perp\!\!\!\perp \hat Y \mid \hat X$ . The collection of all such couplings is $\Gamma_c(\mathbb{P}, \mathbb{Q})$ .

The $p$ –Causal Sinkhorn Discrepancy $R_p(\mathbb{P}, \mathbb{Q})$ is

$R_p(\mathbb{P}, \mathbb{Q}) := \Bigg( \inf_{\gamma \in \Gamma_c(\mathbb{P}, \mathbb{Q})}\; \mathbb{E}_{((\hat x, \hat y), (x, y))\sim\gamma} \big[\, c_p((\hat x, \hat y), (x, y)) + \epsilon H(\gamma \mid \mu \otimes \nu_{\mathcal{X}} \otimes \nu_{\mathcal{Y}}) \big] \Bigg)^{1/p},$

where the relative entropy is

$H(\gamma \mid \mu \otimes \nu_{\mathcal{X}} \otimes \nu_{\mathcal{Y}}) = \mathbb{E}_\gamma \Big[ \log \frac{d\gamma}{d (\mu\, d\nu_{\mathcal{X}}\, d\nu_{\mathcal{Y}})} \Big].$

For $\epsilon = 0$ , this reduces to the causal $p$ -Wasserstein distance; for $\Gamma_c \to \Gamma$ (dropping causal restrictions), the non-causal Sinkhorn distance is recovered.

2. Causal Consistency and Entropic Regularization

The CSD distinguishes itself from the standard Sinkhorn discrepancy—a purely entropic OT metric—by enforcing causal consistency. Standard Sinkhorn distance over $\Gamma(\mathbb{P}, \mathbb{Q})$ allows couplings to “anticipate” future outcomes (i.e., $X$ may depend on $\hat Y$ ), potentially violating natural temporal or informational orderings. In contrast, CSD restricts to $\Gamma_c(\mathbb{P}, \mathbb{Q})$ where, conditioning on past covariates $\hat X$ , future covariates $X$ must not depend on unseen outcomes $\hat Y$ . This ensures the resulting transport plans do not introduce feedback loops or causally implausible information flows. In the context of DRO, this yields ambiguity sets that exclude alternatives where future decisions would be made with foreknowledge of future uncertainties [(Zhang et al., 16 Jan 2026), Definition 2.2 and Figure 1.2].

Entropic regularization ( $\epsilon > 0$ ) compels the solution to yield continuous, rather than purely discrete, couplings, enhancing the smoothness and computational tractability of the resulting discrepancy metric.

3. Duality, Strong Dual Formulation, and Optimization

Given an empirical distribution $\hat{\mathbb{P}} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y})$ , the robust optimization problem

$v_P := \max_{\mathbb{P} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y}),\; R_p(\hat{\mathbb{P}}, \mathbb{P})^p \le \rho^p} \mathbb{E}_{(x,y) \sim \mathbb{P}}[\Psi(f(x), y)]$

admits a strong dual formulation:

$v_D = \inf_{\lambda \geq 0} \Bigg\{ \lambda \rho^p + \mathbb{E}_{\hat x \sim \hat{\mathbb{P}}_x} \left[ \lambda \epsilon \log \int_{\mathcal{X}} \exp\left(\frac{g(\hat x, x; \lambda)}{\lambda \epsilon}\right) d\nu_{\mathcal{X}}(x) \right] \Bigg\},$

where

$g(\hat x, x; \lambda) = \mathbb{E}_{\hat y \sim \hat{\mathbb{P}}_{y|x=\hat x}} \left[ \lambda \epsilon \log \int_{\mathcal{Y}} \exp \left( \frac{\Psi(f(x), y) - \lambda c_p((\hat x, \hat y), (x,y))}{\lambda \epsilon} \right) d\nu_{\mathcal{Y}}(y) \right].$

Under mild conditions, strong duality holds and the infimum is strictly attained at a unique $\lambda^*>0$ (by strict convexity in $\lambda$ ) [(Zhang et al., 16 Jan 2026), Theorem 3.1].

4. Structure of the Worst-Case Distribution

If $\lambda^*>0$ solves the dual problem, the worst-case distribution $\mathbb{P}^*$ achieving $v_P$ exhibits the density

$\frac{d\mathbb{P}^*(x,y)}{d(\nu_{\mathcal{X}} \otimes \nu_{\mathcal{Y}})} = \mathbb{E}_{(\hat x, \hat y) \sim \hat{\mathbb{P}}} \bigg[ \alpha(\hat x)\, \beta(\hat x, \hat y; x)\; \exp\big\{ r(\hat x, x) + s(\hat x, \hat y; x, y) \big\} \bigg],$

with \begin{align*} r(\hat x, x) &:= \mathbb{E}{\hat y | \hat x} \left[ \log \int \exp(s(\hat x, \hat y; x, y)) d\nu{\mathcal{Y}}(y) \right], \ s(\hat x, \hat y; x, y) &:= \frac{\Psi(f(x), y) - \lambda^* c_p((\hat x, \hat y), (x, y))}{\lambda^* \epsilon}, \ \alpha(\hat x) &:= \left[ \int_{\mathcal{X}} \exp(r(\hat x, x')) d\nu_{\mathcal{X}}(x') \right]^{-1}, \ \beta(\hat x, \hat y; x) &:= \left[ \int_{\mathcal{Y}} \exp(s(\hat x, \hat y; x, y')) d\nu_{\mathcal{Y}}(y') \right]^{-1}. \end{align*} This density is a continuous mixture of Gibbs kernels, in contrast to the discrete support typically arising in non-regularized or non-causal OT settings [(Zhang et al., 16 Jan 2026), Theorem 3.2].

5. Theoretical Properties: Metricity, Regularity, and Optimization

Metric properties and continuity: $R_p(\cdot, \cdot)$ is a true metric on $\mathcal{P}(\mathcal{X} \times \mathcal{Y})$ under causal couplings. The entropic regularization ( $\epsilon > 0$ ) ensures strong convexity and smoothness in the dual formulation, providing stability and numerical tractability.
Tractability: The dual reformulation reduces the CSD evaluation to a single-level minimization over $\lambda$ involving expectations of log-sum-exp functions. For parametric decision rules $f_\theta$ , the outer min-max problem can be expressed as a multi-level stochastic compositional optimization.
Optimization methods: Sample-Average Approximation (SAA) achieves $\delta$ -optimality in $O(\delta^{-2})$ Monte Carlo samples, and a Stochastically Corrected Stochastic Compositional (SCSC) gradient method achieves an $\varepsilon$ -stationary point in $O(\varepsilon^{-4})$ gradient iterations, which matches lower complexity bounds for nonconvex stochastic optimization [(Zhang et al., 16 Jan 2026), Propositions 4.2, 4.3; Theorems 5.1, 5.2].

6. Causal Sinkhorn Discrepancy in Contextual DRO

CSD enables the formulation of ambiguity sets in contextual DRO that respect both entropy-induced continuity and causal transport restrictions. Given empirical joint law $\hat{\mathbb{P}}$ , the ambiguity set is

$\mathcal{A}_\rho := \{ \mathbb{P} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y}) : R_p(\hat{\mathbb{P}}, \mathbb{P})^p \leq \rho^p \}.$

This set excludes (i) discrete couplings for $\epsilon>0$ (thus enforcing continuity), and (ii) causally implausible transports (enforcing $X \perp\!\!\!\perp \hat Y | \hat X$ ). The resulting contextual DRO problem is:

$\inf_{f \in \mathcal{F}} \sup_{\mathbb{P} \in \mathcal{A}_\rho} \mathbb{E}_{(x,y) \sim \mathbb{P}} [\Psi(f(x), y)].$

This structure ensures that worst-case analysis never incorporates information unavailable under natural causal orderings and that predictions/policies are robust to plausible distributional perturbations consistent with the observed data-generating process [(Zhang et al., 16 Jan 2026), Section 6].

7. Summary Table: Key Distinctions

Property	Standard Sinkhorn	Causal Sinkhorn Discrepancy (CSD)
Coupling Constraint	All couplings ( $\Gamma$ )	Causal couplings ( $\Gamma_c$ )
Feedback Allowed	Yes	No (enforces $X \perp\!\!\!\perp \hat Y \| \hat X$ )
Entropic Regularization	Optional	Required for continuity, tractability
Metricity	Non-causal OT metric	Causal metric on $\mathcal{P}(\mathcal{X} \times \mathcal{Y})$
Worst-case Distribution	Discrete/Continuous	Continuous mixture of Gibbs kernels

The Causal Sinkhorn Discrepancy is a foundational tool for robust, interpretable, and causally consistent optimization in modern machine learning and statistics (Zhang et al., 16 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causal Sinkhorn Discrepancy (CSD).

Causal Sinkhorn Discrepancy: Theory & Applications

1. Formal Definition and Mathematical Structure

2. Causal Consistency and Entropic Regularization

3. Duality, Strong Dual Formulation, and Optimization

4. Structure of the Worst-Case Distribution

5. Theoretical Properties: Metricity, Regularity, and Optimization

6. Causal Sinkhorn Discrepancy in Contextual DRO

7. Summary Table: Key Distinctions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Causal Sinkhorn Discrepancy: Theory & Applications

1. Formal Definition and Mathematical Structure

2. Causal Consistency and Entropic Regularization

3. Duality, Strong Dual Formulation, and Optimization

4. Structure of the Worst-Case Distribution

5. Theoretical Properties: Metricity, Regularity, and Optimization

6. Causal Sinkhorn Discrepancy in Contextual DRO

7. Summary Table: Key Distinctions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research