Causal Sinkhorn Discrepancy: Theory & Applications
- Causal Sinkhorn Discrepancy (CSD) is an entropy-regularized causal optimal transport metric that enforces non-anticipative constraints between stochastic processes.
- It employs entropic regularization to yield continuous transport plans, ensuring computational tractability and robust performance in contextual DRO.
- CSD supports robust optimization through strong dual formulations and continuous Gibbs kernel mixtures, enhancing interpretability and stability in worst-case analysis.
The Causal Sinkhorn Discrepancy (CSD) is an entropy-regularized variant of the causal optimal transport (OT) metric, tailored to enforce both causal consistency and continuous transport plans when comparing stochastic processes or joint probability distributions. Designed to address challenges in contextual distributionally robust optimization (DRO), CSD provides a tractable and interpretable means for quantifying divergence between probability measures while preserving non-anticipative structures essential for modeling causal systems (Zhang et al., 16 Jan 2026).
1. Formal Definition and Mathematical Structure
Let and be Polish (or normed) spaces. Given two probability measures , reference measures , , , regularization parameter and , define the cost:
A coupling is causal if for , it satisfies . The collection of all such couplings is .
The –Causal Sinkhorn Discrepancy is
where the relative entropy is
For , this reduces to the causal -Wasserstein distance; for (dropping causal restrictions), the non-causal Sinkhorn distance is recovered.
2. Causal Consistency and Entropic Regularization
The CSD distinguishes itself from the standard Sinkhorn discrepancy—a purely entropic OT metric—by enforcing causal consistency. Standard Sinkhorn distance over allows couplings to “anticipate” future outcomes (i.e., may depend on ), potentially violating natural temporal or informational orderings. In contrast, CSD restricts to where, conditioning on past covariates , future covariates must not depend on unseen outcomes . This ensures the resulting transport plans do not introduce feedback loops or causally implausible information flows. In the context of DRO, this yields ambiguity sets that exclude alternatives where future decisions would be made with foreknowledge of future uncertainties [(Zhang et al., 16 Jan 2026), Definition 2.2 and Figure 1.2].
Entropic regularization () compels the solution to yield continuous, rather than purely discrete, couplings, enhancing the smoothness and computational tractability of the resulting discrepancy metric.
3. Duality, Strong Dual Formulation, and Optimization
Given an empirical distribution , the robust optimization problem
admits a strong dual formulation:
where
Under mild conditions, strong duality holds and the infimum is strictly attained at a unique (by strict convexity in ) [(Zhang et al., 16 Jan 2026), Theorem 3.1].
4. Structure of the Worst-Case Distribution
If solves the dual problem, the worst-case distribution achieving exhibits the density
with \begin{align*} r(\hat x, x) &:= \mathbb{E}{\hat y | \hat x} \left[ \log \int \exp(s(\hat x, \hat y; x, y)) d\nu{\mathcal{Y}}(y) \right], \ s(\hat x, \hat y; x, y) &:= \frac{\Psi(f(x), y) - \lambda* c_p((\hat x, \hat y), (x, y))}{\lambda* \epsilon}, \ \alpha(\hat x) &:= \left[ \int_{\mathcal{X}} \exp(r(\hat x, x')) d\nu_{\mathcal{X}}(x') \right]{-1}, \ \beta(\hat x, \hat y; x) &:= \left[ \int_{\mathcal{Y}} \exp(s(\hat x, \hat y; x, y')) d\nu_{\mathcal{Y}}(y') \right]{-1}. \end{align*} This density is a continuous mixture of Gibbs kernels, in contrast to the discrete support typically arising in non-regularized or non-causal OT settings [(Zhang et al., 16 Jan 2026), Theorem 3.2].
5. Theoretical Properties: Metricity, Regularity, and Optimization
- Metric properties and continuity: is a true metric on under causal couplings. The entropic regularization () ensures strong convexity and smoothness in the dual formulation, providing stability and numerical tractability.
- Tractability: The dual reformulation reduces the CSD evaluation to a single-level minimization over involving expectations of log-sum-exp functions. For parametric decision rules , the outer min-max problem can be expressed as a multi-level stochastic compositional optimization.
- Optimization methods: Sample-Average Approximation (SAA) achieves -optimality in Monte Carlo samples, and a Stochastically Corrected Stochastic Compositional (SCSC) gradient method achieves an -stationary point in gradient iterations, which matches lower complexity bounds for nonconvex stochastic optimization [(Zhang et al., 16 Jan 2026), Propositions 4.2, 4.3; Theorems 5.1, 5.2].
6. Causal Sinkhorn Discrepancy in Contextual DRO
CSD enables the formulation of ambiguity sets in contextual DRO that respect both entropy-induced continuity and causal transport restrictions. Given empirical joint law , the ambiguity set is
This set excludes (i) discrete couplings for (thus enforcing continuity), and (ii) causally implausible transports (enforcing ). The resulting contextual DRO problem is:
This structure ensures that worst-case analysis never incorporates information unavailable under natural causal orderings and that predictions/policies are robust to plausible distributional perturbations consistent with the observed data-generating process [(Zhang et al., 16 Jan 2026), Section 6].
7. Summary Table: Key Distinctions
| Property | Standard Sinkhorn | Causal Sinkhorn Discrepancy (CSD) |
|---|---|---|
| Coupling Constraint | All couplings () | Causal couplings () |
| Feedback Allowed | Yes | No (enforces ) |
| Entropic Regularization | Optional | Required for continuity, tractability |
| Metricity | Non-causal OT metric | Causal metric on |
| Worst-case Distribution | Discrete/Continuous | Continuous mixture of Gibbs kernels |
The Causal Sinkhorn Discrepancy is a foundational tool for robust, interpretable, and causally consistent optimization in modern machine learning and statistics (Zhang et al., 16 Jan 2026).