- The paper introduces CREATOR, a novel algorithm that recovers latent causal features in linear models under weaker assumptions.
- It employs a three-step approach—topological ordering, pruning, and feature disentanglement—to ensure accurate causal structure recovery.
- Experimental results demonstrate superior performance in latent feature and DAG recovery across synthetic datasets and LLM-generated data.
Linear Causal Representation Learning by Topological Ordering, Pruning, and Disentanglement
Introduction
The paper proposes a novel algorithm, CREATOR, designed to address the challenge of causal representation learning (CRL) in linear models. It aims to disentangle complex data-generating mechanisms into causally interpretable latent features. This algorithm stands out by operating under weaker assumptions compared to existing CRL approaches, which often impose stringent constraints on interventional data and noise distributions. Through synthetic experiments and analysis of LLMs, CREATOR demonstrates effective recovery of latent causal features.
The CRL problem is defined within the context of a linear structural causal model (SCM), assuming access to multi-environment data with shared latent causal structures. The model is linear, described as y(k)=W(k)⊤y(k)+Ω(k)z(k) and x(k)=Hy(k). Key assumptions include independent non-Gaussian noise and full-rank mixing matrices, which facilitate identifiability of latent features and causal DAGs up to equivalence classes.
CREATOR Algorithm
CREATOR consists of three main subroutines:
- Topological Ordering and Feature Recovery: This involves inferring a causal ordering and recovering latent features up to a particular equivalence. It uses independent component analysis (ICA) to identify root nodes and iteratively estimates latent variables.
- Pruning: This step refines the initially dense DAG by evaluating rank differences, thereby identifying and removing spurious edges.
- Feature Disentanglement: This final step refines latent features further, utilizing the pruned DAG to ensure features align with the true causal structure.
Figure 1: An illustration of subroutine 1. Dashed nodes and edges are eliminated.
Experimental Validation
Synthetic Experiments
CREATOR is evaluated against LiNGCReL across synthetic datasets with varying latent dimensions and environment numbers. The performance is measured using LocR2 and SHD metrics. Results indicate CREATOR's superior performance in both latent feature and DAG recovery across different settings.



Figure 2: LocR2 and SHD metric for different data generation setup.
Real-World Application
The algorithm is additionally applied to study the latent causal mechanisms of LLMs. By leveraging CRL under a linear representation hypothesis, CREATOR successfully identifies causal structures in generated story datasets. This serves as a promising exploration for enhancing LLM interpretability.
Conclusion
CREATOR offers a robust framework for linear CRL with fewer assumptions, extending its applicability across domains requiring causal interpretability. The algorithm outperforms existing methods in synthetic and real-world settings, hinting at potential advancements in AI model transparency and understanding. Further research may explore extensions to nonlinear models and applications in real-world scenarios involving diverse data modalities.



Figure 3: The impact of topological ordering inference on the performance of CREATOR.