Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linear Causal Representation Learning by Topological Ordering, Pruning, and Disentanglement

Published 26 Sep 2025 in stat.ML and cs.LG | (2509.22553v1)

Abstract: Causal representation learning (CRL) has garnered increasing interests from the causal inference and artificial intelligence community, due to its capability of disentangling potentially complex data-generating mechanism into causally interpretable latent features, by leveraging the heterogeneity of modern datasets. In this paper, we further contribute to the CRL literature, by focusing on the stylized linear structural causal model over the latent features and assuming a linear mixing function that maps latent features to the observed data or measurements. Existing linear CRL methods often rely on stringent assumptions, such as accessibility to single-node interventional data or restrictive distributional constraints on latent features and exogenous measurement noise. However, these prerequisites can be challenging to satisfy in certain scenarios. In this work, we propose a novel linear CRL algorithm that, unlike most existing linear CRL methods, operates under weaker assumptions about environment heterogeneity and data-generating distributions while still recovering latent causal features up to an equivalence class. We further validate our new algorithm via synthetic experiments and an interpretability analysis of LLMs, demonstrating both its superiority over competing methods in finite samples and its potential in integrating causality into AI.

Summary

  • The paper introduces CREATOR, a novel algorithm that recovers latent causal features in linear models under weaker assumptions.
  • It employs a three-step approach—topological ordering, pruning, and feature disentanglement—to ensure accurate causal structure recovery.
  • Experimental results demonstrate superior performance in latent feature and DAG recovery across synthetic datasets and LLM-generated data.

Linear Causal Representation Learning by Topological Ordering, Pruning, and Disentanglement

Introduction

The paper proposes a novel algorithm, CREATOR, designed to address the challenge of causal representation learning (CRL) in linear models. It aims to disentangle complex data-generating mechanisms into causally interpretable latent features. This algorithm stands out by operating under weaker assumptions compared to existing CRL approaches, which often impose stringent constraints on interventional data and noise distributions. Through synthetic experiments and analysis of LLMs, CREATOR demonstrates effective recovery of latent causal features.

Problem Formulation and Assumptions

The CRL problem is defined within the context of a linear structural causal model (SCM), assuming access to multi-environment data with shared latent causal structures. The model is linear, described as y(k)=W(k)⊤y(k)+Ω(k)z(k)y^{(k)} = W^{(k)\top} y^{(k)} + \Omega^{(k)} z^{(k)} and x(k)=Hy(k)x^{(k)} = H y^{(k)}. Key assumptions include independent non-Gaussian noise and full-rank mixing matrices, which facilitate identifiability of latent features and causal DAGs up to equivalence classes.

CREATOR Algorithm

CREATOR consists of three main subroutines:

  1. Topological Ordering and Feature Recovery: This involves inferring a causal ordering and recovering latent features up to a particular equivalence. It uses independent component analysis (ICA) to identify root nodes and iteratively estimates latent variables.
  2. Pruning: This step refines the initially dense DAG by evaluating rank differences, thereby identifying and removing spurious edges.
  3. Feature Disentanglement: This final step refines latent features further, utilizing the pruned DAG to ensure features align with the true causal structure. Figure 1

    Figure 1: An illustration of subroutine 1. Dashed nodes and edges are eliminated.

Experimental Validation

Synthetic Experiments

CREATOR is evaluated against LiNGCReL across synthetic datasets with varying latent dimensions and environment numbers. The performance is measured using LocR2^2 and SHD metrics. Results indicate CREATOR's superior performance in both latent feature and DAG recovery across different settings. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: LocR2 and SHD metric for different data generation setup.

Real-World Application

The algorithm is additionally applied to study the latent causal mechanisms of LLMs. By leveraging CRL under a linear representation hypothesis, CREATOR successfully identifies causal structures in generated story datasets. This serves as a promising exploration for enhancing LLM interpretability.

Conclusion

CREATOR offers a robust framework for linear CRL with fewer assumptions, extending its applicability across domains requiring causal interpretability. The algorithm outperforms existing methods in synthetic and real-world settings, hinting at potential advancements in AI model transparency and understanding. Further research may explore extensions to nonlinear models and applications in real-world scenarios involving diverse data modalities. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: The impact of topological ordering inference on the performance of CREATOR.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.