Invariant Feature Extraction Through Conditional Independence and the Optimal Transport Barycenter Problem: the Gaussian case

Published 24 Dec 2025 in math.ST, stat.AP, and stat.ML | (2512.20914v1)

Abstract: A methodology is developed to extract $d$ invariant features $W=f(X)$ that predict a response variable $Y$ without being confounded by variables $Z$ that may influence both $X$ and $Y$. The methodology's main ingredient is the penalization of any statistical dependence between $W$ and $Z$ conditioned on $Y$, replaced by the more readily implementable plain independence between $W$ and the random variable $Z_Y = T(Z,Y)$ that solves the [Monge] Optimal Transport Barycenter Problem for $Z\mid Y$. In the Gaussian case considered in this article, the two statements are equivalent. When the true confounders $Z$ are unknown, other measurable contextual variables $S$ can be used as surrogates, a replacement that involves no relaxation in the Gaussian case if the covariance matrix $Σ_{ZS}$ has full range. The resulting linear feature extractor adopts a closed form in terms of the first $d$ eigenvectors of a known matrix. The procedure extends with little change to more general, non-Gaussian / non-linear cases.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel framework for invariant feature extraction using conditional independence to enhance out-of-distribution generalization.
Methodologically, it reduces the problem to an eigen-decomposition task by leveraging the optimal transport barycenter in Gaussian cases.
Empirical evaluations show that the barycentric method outperforms anchor regression in mitigating confounding and lowering target MSE.

Invariant Feature Extraction via Conditional Independence and Optimal Transport Barycenters: The Gaussian Case

Problem Formulation and Theoretical Foundations

The paper "Invariant Feature Extraction Through Conditional Independence and the Optimal Transport Barycenter Problem: the Gaussian case" (2512.20914) develops a methodology for out-of-distribution (OOD) generalization by extracting invariant features from data, emphasizing cases where confounding variables complicate transfer learning. The central goal is to construct features $W = f(X)$ that are maximally predictive for a response variable $Y$ in new environments, while being unconfounded by latent variables $Z$ potentially entangled with both $X$ and $Y$ . The methodology formalizes this robustness by penalizing any residual conditional dependence between $W$ and $Z$ given $Y$ .

The theoretical core involves leveraging conditional independence in conjunction with the Monge Optimal Transport Barycenter Problem (OTBP). In the Gaussian regime, the requirement $W \perp Z\,|\,Y$ is shown to be equivalent to independence between $W$ and the barycentric residual $T(Z, Y)$ , which solves an OTBP over the conditional distributions $Z\,|\,Y$ . When $Z$ is not directly observed, the framework admits observable surrogates $S$ under regularity conditions, with full-rank $\Sigma_{ZS}$ ensuring that independence from $S$ given $Y$ implies independence from $Z$ given $Y$ .

This conditional invariance objective is paired with a predictive sufficiency criterion: $W$ should preserve maximal information relevant to $Y$ . The joint optimization is obtained via a closed-form solution in the Gaussian case, where the problem reduces to an eigen-decomposition over a matrix combining predictive and invariance penalties.

Optimal Transport Barycenter Construction and Feature Extraction

The approach operationalizes conditional invariance using the optimal transport barycenter mapping $T(Z, Y)$ , defined as the solution to

$T(Z, Y) = \arg \min_{U: U \perp Y} \mathbb{E}[c(Z, U)].$

with $c$ the quadratic cost. This barycentric residual encapsulates the portion of $Z$ unexplainable by $Y$ , and requiring $W$ to be independent of $T(Z, Y)$ effectively removes spurious correlations due to environment-specific confounding. The Gaussian case yields a practical implementation: $T(Z, Y)$ coincides with the regression residual of $Z$ on $Y$ .

For feature extraction, a primary loss function quantifies the trade-off between predictive sufficiency and conditional invariance:

$L_\lambda(a) = (1 - \lambda) (a^\top C)^2 - \lambda (a^\top D)^2,$

where $C = \Sigma_{X Y}$ and $D = \Sigma_{X \tilde{Z}}$ . The optimal direction $a^\ast$ is the lead eigenvector of $H = (1-\lambda) CC^\top - \lambda DD^\top$ . For multidimensional scenarios, the optimal feature matrix $A$ is constructed from the top $d$ eigenvectors of the generalized $H$ .

Figure 1: Scatterplot comparing the best target MSE for barycentric versus anchor regression methods across distribution shifts, with barycentric frequently outperforming Anchor.

In settings where $Z$ is unobserved, an analogous procedure is possible using contextual variables $S$ , provided the correlation structure $\Sigma_{ZS}$ guarantees their informativeness. This surrogate-based relaxation is theoretically justified in the paper via conditional independence lemmas and holds exactly in the joint Gaussian case with full-rank covariance.

Empirical Evaluation and Comparative Results

The paper conducts systematic population-level experiments under structured covariance shifts to compare the barycentric method, anchor regression, and OLS. Across varied source--target environment pairs, barycentric reduction yields consistently lower target MSE compared to both anchor regression (using $S$ as anchor) and OLS, especially as the Frobenius distance between covariances increases:

Figure 2: Frobenius distance between source and target covariances versus method achieving lowest target error—barycentric method excels with substantial distributional shift.

Conditional invariance is systematically enhanced by increasing the regularization parameter $\lambda$ , with observed decay in conditional correlation curves across independently generated environments. The finite-sample regime demonstrates that tuning $\lambda$ is critical, with a bimodal distribution for the optimal $\lambda$ : OLS prevails when shifts are negligible ( $\lambda=0$ ), while maximal invariance is essential under pronounced shifts ( $\lambda=1$ ), substantiating the importance of penalizing confounding in OOD generalization.

Figure 3: Conditional correlation curves $\lambda \mapsto \|\,\text{Corr}(W_\lambda, Z | Y)\|_F$ showing systematic reduction in conditional correlation as invariance penalty increases.

Figure 4: Distribution of optimal $\lambda$ , highlighting bimodal behavior with modes at OLS ( $\lambda=0$ ) and maximal invariance ( $\lambda=1$ ).

Implications and Extensions

This barycentric methodology introduces a principled framework for extracting invariant features with theoretical guarantees tied to optimal transport. It unifies several threads in causal representation learning, IRM, and domain generalization by leveraging conditional independence rigorously and operationally with OTBP machinery.

Practical implications include:

Improved OOD generalization via principled removal of spurious correlations.
Applicability in scenarios with latent confounders and only surrogate variables.
Explicit optimization procedures amenable to closed-form solutions in linear Gaussian settings, and extensible to nonlinear/non-Gaussian cases via computational optimal transport algorithms.

Theoretically, this approach bypasses the need for explicit causal models or full observability of confounders, relying instead on independence structure and covariance geometry. Extensions to the framework are naturally envisioned: handling categorical contextual variables through simplex embeddings, incorporating multiple source environments for improved cross-validation and confounding mitigation, and moving beyond linear models via RKHS strategies and off-line computation of barycenters in general distributions.

The use of barycentric reduction offers a consistent generalization of Fisher-LDA in classification and a robust alternative to anchor regression for regression under environment shifts.

Conclusion

The presented methodology offers a mathematically rigorous and practically effective principle for invariant feature extraction in the presence of confounding factors. Through optimal transport barycenter theory and conditional independence, it enables robust transfer learning and predictive modeling across environments, with closed-form solutions in Gaussian cases and extensibility to more general regimes. The empirical and theoretical contributions lay foundations for future work in invariant conditional density estimation, nonlinear generalization, and domain-adaptive learning in complex data regimes.

Markdown Report Issue