Beyond identifiability: Learning causal representations with few environments and finite samples

Published 26 Mar 2026 in stat.ML, cs.AI, cs.LG, and math.ST | (2603.25796v1)

Abstract: We provide explicit, finite-sample guarantees for learning causal representations from data with a sublinear number of environments. Causal representation learning seeks to provide a rigourous foundation for the general representation learning problem by bridging causal models with latent factor models in order to learn interpretable representations with causal semantics. Despite a blossoming theory of identifiability in causal representation learning, estimation and finite-sample bounds are less well understood. We show that causal representations can be learned with only a logarithmic number of unknown, multi-node interventions, and that the intervention targets need not be carefully designed in advance. Through a careful perturbation analysis, we provide a new analysis of this problem that guarantees consistent recovery of (a) the latent causal graph, (b) the mixing matrix and representations, and (c) \emph{unknown} intervention targets.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces finite-sample guarantees enabling exact recovery of latent causal structures using only logarithmically many interventions and covariance data.
It adopts a likelihood-free, projection-based eigen-counting methodology that robustly estimates intervention targets, mixing matrices, and the latent graph in high dimensions.
Finite-sample statistical bounds and perturbation analysis ensure recovery accuracy, paving the way for practical applications in computational biology and out-of-distribution generalization.

Explicit Finite-Sample Guarantees in Causal Representation Learning with Few Environments

Introduction

The paper "Beyond identifiability: Learning causal representations with few environments and finite samples" (2603.25796) presents a rigorous statistical analysis of causal representation learning (CRL) in high-dimensional linear factor models. The central contribution is the development of non-asymptotic, finite-sample guarantees for the estimation of latent causal structure from observational data and a sublinear number of interventional environments, explicitly logarithmic in the latent dimension. The work is situated at the intersection of causal inference, latent variable modeling, and statistical machine learning, addressing the principal challenge of reliably learning interpretable, causally meaningful representations in the presence of finite sample sizes and unknown multi-target interventions.

Background and Problem Setting

CRL aims to uncover structured latent representations that align with an underlying causal model, supporting interpretability, interventional robustness, and out-of-distribution generalization. The paper considers the standard linear structural equation model (SEM) for latent variables $Z$ and observed variables $X = BZ$ , with a mixing/decoder matrix $B$ . Identification of the mixing matrix $B$ , the latent graph $G$ , and intervention targets $I(k)$ is nontrivial due to the ambiguities inherent in linear latent models and the compounded challenges arising from hidden structure and multi-environment data.

Previous identifiability results have established thresholds (e.g., $\Omega(d)$ environments for single-target interventions) but have not addressed the finite-sample consistency of estimation in practical settings, particularly when the number of environments is sublinear and the intervention designs are unknown and multi-targeted. Notably, prior work often focuses on information-theoretic identifiability, not on constructively achieving estimation guarantees with realistic sample sizes and high ambient dimension.

Methodology

The authors propose a likelihood-free estimation procedure exploiting only second-order statistics (sample covariances), which bypasses restrictive distributional assumptions (e.g., Gaussianity, independence) and restrictive structural assumptions (e.g., sparsity, pure child conditions) common in traditional factor analysis and CRL. The core steps are:

Recovery of Unknown Intervention Targets: The intervention design matrix is reconstructed using intersections of column spaces of environment-specific covariance matrices, characterized by dimension functions $g(T)$ computed via projection-based eigen-counting. This enables the identification of the set of intervened nodes in each environment from only covariance data.
Estimation of Mixing Matrix $B$ : Having identified the intervention targets, the column space intersections further allow the recovery of each column of $B$ explicitly (up to permutation and scale), exploiting the structure induced by the interventions.
Learning the Latent Causal Graph $X = BZ$ 0: Given estimators for $X = BZ$ 1 and the intervention structure, the latent graph is recovered via analysis of generalized eigenvalue problems constructed from pairs of sample covariances associated with noise interventions. The zero pattern of the solution encodes the edges of the estimated graph.

Critically, the authors provide explicit computational recipes for each stage, requiring only routine linear algebra operations. The approach is robust to ill-conditioning and noise, with thresholds for projection-based eigen-counting and graph recovery that are calibrated based on finite-sample perturbation results.

Theoretical Results

Identifiability

The paper provides a formal identifiability theorem showing that causal representations, decoder, and intervention targets are recoverable up to scale and permutation under only $X = BZ$ 2 unknown, multi-node interventions. This matches information-theoretic lower bounds and requires no pre-specification of intervention design.

Finite-Sample Statistical Guarantees

The authors establish uniform high-probability bounds for all stages of the estimation pipeline:

Recovery of Intervention Targets: The plug-in estimator for intervention indices achieves exact recovery with probability at least $X = BZ$ 3, under a mild regularity condition controlling the ill-conditioning of $X = BZ$ 4 (Condition (A3)). The threshold for projection-based eigen-counting is derived from explicit perturbation bounds on the intersection of estimated column spaces.
Mixing Matrix Estimation: For the estimator $X = BZ$ 5, the Frobenius error rate is $X = BZ$ 6 up to diagonal scaling, where $X = BZ$ 7 is the maximum environment support size and $X = BZ$ 8 quantifies the eigengap of $X = BZ$ 9. This rate is uniform over all columns of $B$ 0.
Latent Causal Graph Recovery: The estimator $B$ 1 achieves exact edge recovery with probability at least $B$ 2, as long as the minimum nonzero edge weight $B$ 3 exceeds an explicit, sample-dependent threshold. The edge recovery threshold $B$ 4 and associated lower bound on $B$ 5 scale as $B$ 6 under bounded eigenvalues of the latent covariance.

These results are derived via a perturbation analysis leveraging the spectral properties of products of projection matrices, which control the stability of column space intersections and generalized eigenvector analysis under noisy, finite-sample covariance estimates.

Notably, the paper demonstrates that exact recovery of both latent structure and mixing matrix is possible with only logarithmic number of environments and polynomially small sample size per environment. This stands in contrast to prior results requiring linear environments and/or asymptotic analysis.

Implications and Connections

The presented methodology and theory advance the state of the art in statistical CRL by providing, for the first time, uniform finite-sample guarantees in the high-dimensional, sublinear-environment regime with unknown intervention design. This directly supports applications in computational biology (e.g., combinatorial CRISPR screening), computer vision (concept interventions), and any domain where simultaneous perturbation of latent factors is plausible but the intervention mapping is not fully annotated.

The approach also relaxes several classic assumptions:

No distributional restrictions on noise or latent structure.
Decoder $B$ 7 is allowed to be ill-conditioned, subject to explicit, sample-size dependent lower bounds.
No sparsity or pure-child structural constraint is required for identifiability.

The authors' projection-based eigen-counting analysis also provides a template technical device for future work on finite-sample properties of other latent variable models with combinatorial intervention structures.

Limitations and Future Directions

The results are currently confined to the linear SEM setting. Extending this framework to nonlinear generative models is a substantial open problem, as the algebraic structure enabling covariance-based recovery does not persist in the nonlinear case. Further, while only logarithmic environments are sufficient, the method's practical sample requirements and sensitivity to various forms of ill-conditioning and noise heterogeneity warrant further empirical and theoretical investigation. There are also natural questions about robustness to model misspecification, violations of independence, and more general forms of intervention, including continuous interventions and soft interventions.

Broader integration with contemporary disentangled representation learning, as well as connections to invariant risk minimization and out-of-distribution generalization, are promising avenues for continued research based on the statistical guarantees established here.

Conclusion

This paper makes a comprehensive and technically substantial contribution to the statistical foundation of causal representation learning by establishing finite-sample guarantees for the recovery of latent causal structure, mixing matrices, and intervention designs with only a polylogarithmic number of environments. The projection-based eigen-counting approach enables consistent estimation in regimes previously out of reach, bridging the gap between information-theoretic identifiability and practical estimation. These insights pave the way for statistically grounded CRL in high-dimensional settings and pose sharp questions for nonlinear extensions and new theoretical developments.

Markdown Report Issue