Papers
Topics
Authors
Recent
Search
2000 character limit reached

Invariant Feature Extraction Through Conditional Independence and the Optimal Transport Barycenter Problem: the Gaussian case

Published 24 Dec 2025 in math.ST, stat.AP, and stat.ML | (2512.20914v1)

Abstract: A methodology is developed to extract $d$ invariant features $W=f(X)$ that predict a response variable $Y$ without being confounded by variables $Z$ that may influence both $X$ and $Y$. The methodology's main ingredient is the penalization of any statistical dependence between $W$ and $Z$ conditioned on $Y$, replaced by the more readily implementable plain independence between $W$ and the random variable $Z_Y = T(Z,Y)$ that solves the [Monge] Optimal Transport Barycenter Problem for $Z\mid Y$. In the Gaussian case considered in this article, the two statements are equivalent. When the true confounders $Z$ are unknown, other measurable contextual variables $S$ can be used as surrogates, a replacement that involves no relaxation in the Gaussian case if the covariance matrix $Σ_{ZS}$ has full range. The resulting linear feature extractor adopts a closed form in terms of the first $d$ eigenvectors of a known matrix. The procedure extends with little change to more general, non-Gaussian / non-linear cases.

Summary

  • The paper introduces a novel framework for invariant feature extraction using conditional independence to enhance out-of-distribution generalization.
  • Methodologically, it reduces the problem to an eigen-decomposition task by leveraging the optimal transport barycenter in Gaussian cases.
  • Empirical evaluations show that the barycentric method outperforms anchor regression in mitigating confounding and lowering target MSE.

Invariant Feature Extraction via Conditional Independence and Optimal Transport Barycenters: The Gaussian Case

Problem Formulation and Theoretical Foundations

The paper "Invariant Feature Extraction Through Conditional Independence and the Optimal Transport Barycenter Problem: the Gaussian case" (2512.20914) develops a methodology for out-of-distribution (OOD) generalization by extracting invariant features from data, emphasizing cases where confounding variables complicate transfer learning. The central goal is to construct features W=f(X)W = f(X) that are maximally predictive for a response variable YY in new environments, while being unconfounded by latent variables ZZ potentially entangled with both XX and YY. The methodology formalizes this robustness by penalizing any residual conditional dependence between WW and ZZ given YY.

The theoretical core involves leveraging conditional independence in conjunction with the Monge Optimal Transport Barycenter Problem (OTBP). In the Gaussian regime, the requirement WZYW \perp Z\,|\,Y is shown to be equivalent to independence between WW and the barycentric residual T(Z,Y)T(Z, Y), which solves an OTBP over the conditional distributions ZYZ\,|\,Y. When ZZ is not directly observed, the framework admits observable surrogates SS under regularity conditions, with full-rank ΣZS\Sigma_{ZS} ensuring that independence from SS given YY implies independence from ZZ given YY.

This conditional invariance objective is paired with a predictive sufficiency criterion: WW should preserve maximal information relevant to YY. The joint optimization is obtained via a closed-form solution in the Gaussian case, where the problem reduces to an eigen-decomposition over a matrix combining predictive and invariance penalties.

Optimal Transport Barycenter Construction and Feature Extraction

The approach operationalizes conditional invariance using the optimal transport barycenter mapping T(Z,Y)T(Z, Y), defined as the solution to

T(Z,Y)=argminU:UYE[c(Z,U)].T(Z, Y) = \arg \min_{U: U \perp Y} \mathbb{E}[c(Z, U)].

with cc the quadratic cost. This barycentric residual encapsulates the portion of ZZ unexplainable by YY, and requiring WW to be independent of T(Z,Y)T(Z, Y) effectively removes spurious correlations due to environment-specific confounding. The Gaussian case yields a practical implementation: T(Z,Y)T(Z, Y) coincides with the regression residual of ZZ on YY.

For feature extraction, a primary loss function quantifies the trade-off between predictive sufficiency and conditional invariance:

Lλ(a)=(1λ)(aC)2λ(aD)2,L_\lambda(a) = (1 - \lambda) (a^\top C)^2 - \lambda (a^\top D)^2,

where C=ΣXYC = \Sigma_{X Y} and D=ΣXZ~D = \Sigma_{X \tilde{Z}}. The optimal direction aa^\ast is the lead eigenvector of H=(1λ)CCλDDH = (1-\lambda) CC^\top - \lambda DD^\top. For multidimensional scenarios, the optimal feature matrix AA is constructed from the top dd eigenvectors of the generalized HH. Figure 1

Figure 1: Scatterplot comparing the best target MSE for barycentric versus anchor regression methods across distribution shifts, with barycentric frequently outperforming Anchor.

In settings where ZZ is unobserved, an analogous procedure is possible using contextual variables SS, provided the correlation structure ΣZS\Sigma_{ZS} guarantees their informativeness. This surrogate-based relaxation is theoretically justified in the paper via conditional independence lemmas and holds exactly in the joint Gaussian case with full-rank covariance.

Empirical Evaluation and Comparative Results

The paper conducts systematic population-level experiments under structured covariance shifts to compare the barycentric method, anchor regression, and OLS. Across varied source--target environment pairs, barycentric reduction yields consistently lower target MSE compared to both anchor regression (using SS as anchor) and OLS, especially as the Frobenius distance between covariances increases: Figure 2

Figure 2: Frobenius distance between source and target covariances versus method achieving lowest target error—barycentric method excels with substantial distributional shift.

Conditional invariance is systematically enhanced by increasing the regularization parameter λ\lambda, with observed decay in conditional correlation curves across independently generated environments. The finite-sample regime demonstrates that tuning λ\lambda is critical, with a bimodal distribution for the optimal λ\lambda: OLS prevails when shifts are negligible (λ=0\lambda=0), while maximal invariance is essential under pronounced shifts (λ=1\lambda=1), substantiating the importance of penalizing confounding in OOD generalization. Figure 3

Figure 3: Conditional correlation curves λCorr(Wλ,ZY)F\lambda \mapsto \|\,\text{Corr}(W_\lambda, Z | Y)\|_F showing systematic reduction in conditional correlation as invariance penalty increases.

Figure 4

Figure 4

Figure 4: Distribution of optimal λ\lambda, highlighting bimodal behavior with modes at OLS (λ=0\lambda=0) and maximal invariance (λ=1\lambda=1).

Implications and Extensions

This barycentric methodology introduces a principled framework for extracting invariant features with theoretical guarantees tied to optimal transport. It unifies several threads in causal representation learning, IRM, and domain generalization by leveraging conditional independence rigorously and operationally with OTBP machinery.

Practical implications include:

  • Improved OOD generalization via principled removal of spurious correlations.
  • Applicability in scenarios with latent confounders and only surrogate variables.
  • Explicit optimization procedures amenable to closed-form solutions in linear Gaussian settings, and extensible to nonlinear/non-Gaussian cases via computational optimal transport algorithms.

Theoretically, this approach bypasses the need for explicit causal models or full observability of confounders, relying instead on independence structure and covariance geometry. Extensions to the framework are naturally envisioned: handling categorical contextual variables through simplex embeddings, incorporating multiple source environments for improved cross-validation and confounding mitigation, and moving beyond linear models via RKHS strategies and off-line computation of barycenters in general distributions.

The use of barycentric reduction offers a consistent generalization of Fisher-LDA in classification and a robust alternative to anchor regression for regression under environment shifts.

Conclusion

The presented methodology offers a mathematically rigorous and practically effective principle for invariant feature extraction in the presence of confounding factors. Through optimal transport barycenter theory and conditional independence, it enables robust transfer learning and predictive modeling across environments, with closed-form solutions in Gaussian cases and extensibility to more general regimes. The empirical and theoretical contributions lay foundations for future work in invariant conditional density estimation, nonlinear generalization, and domain-adaptive learning in complex data regimes.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 12 likes about this paper.