- The paper introduces algorithmic causality, showing that causal structures can emerge by selecting models based on data compression instead of traditional identifiability.
- It proposes a framework using Conditional Feature-Mechanism Programs (CFMPs) that combine probabilistic and feature mechanisms to represent causal factorizations and capture sparse mechanism shifts.
- Experiments reveal that minimizing Finite Codebook Complexity favors sparser, invariant models across environments, offering insights applicable to causal discovery and large-scale language models.
This paper explores the relationship between causality, symmetry, and compression, proposing a framework where causal structures emerge from compressing data across multiple environments, even when traditional causal identifiability assumptions are not met. The authors introduce "algorithmic causality" as an alternative definition of causality applicable in such scenarios.
The core idea is that learning and compression are deeply linked. While traditional causal discovery relies on identifiability (the ability to uniquely determine the causal graph from data), this often requires strong, sometimes unrealistic, assumptions. This paper investigates what can be said about causality when these assumptions fail, particularly in multi-environment settings where mechanisms might be shared or shift sparsely.
1. Identifiability, Compression, and Their Limitations
The paper first reviews the connection between identifiability in causal discovery and compression. Identifiable models, under which data has maximal likelihood, correspond to models whose distributions have minimal cross-entropy with the data. By Shannon's source coding theorem, minimal cross-entropy implies the shortest average coding length for i.i.d. data. Thus, identifiability research aims to justify compression (minimum cross-entropy) as the correct model selection method.
However, this framework has limitations:
- Hard Priors: Identifiability often requires strong assumptions (hard priors) about distribution classes or knowledge of intervention targets, restricting the model space.
- Intervention Knowledge: Many identifiability results depend on knowing intervention types or targets. Without this, multi-environment data is essentially correlational, limiting identifiability.
- Model Complexity: Cross-entropy only accounts for the data-to-model coding length, neglecting the "codebook length" or the complexity of the model itself. Two models computing the same distribution might have vastly different internal complexities.
2. Algorithmic Causality
To address these limitations, the paper introduces algorithmic causality. The intuition is that even if two models (e.g., Turing machines) produce the same probability distribution, their internal structure and thus their description length (complexity) can differ. Compression, considering total description length, can then prefer one over the other.
A key concept is the Conditional Feature-Mechanism Program (CFMP), a class of Turing machines designed to compute probability distributions. A CFMP α operates in three steps:
- Generation: α generates a set of probabilistic mechanisms $\Pcal_\alpha$ (Turing machines computing conditional probability maps, e.g., f(value∣condition)→probability) and feature mechanisms Φα (Turing machines computing feature maps, e.g., x↦ϕ(x)).
- Featurization: α combines probabilistic mechanisms with feature mechanisms to create "featurized mechanisms." For example, a probabilistic mechanism f and feature mechanisms ϕ,ψ can form x↦f(ϕ(x)∣ψ(x)). This step allows for creating reusable components.
- Computation: For a given input x, α selects a sequence of featurized mechanisms and multiplies their outputs to compute $\PP(x)$. If hidden variables are involved, it marginalizes over them.
Examples of models that can be represented as CFMPs include:
- Causal Bayesian Networks (CBNs)
- Context-Specific Bayesian Networks
- Causal Representation Learning models (involving hidden variables)
- G-invariant and G-equivariant learning models
- Statistical density estimators (where the entire joint distribution is a single mechanism)
Algorithmic Causality (Definition 12, informal): Xi algorithmically causes Xj if a model selection method (like compression) selects a Turing machine (e.g., a CFMP) that, for a given input, uses a subprogram of the form "If Xi=… then Xj=…" (i.e., a featurized mechanism f(…Xj…∣…Xi…)) and not the reverse. This is a property of the selected computational model, not just the data distribution.
3. Learning Algorithmic Causality by Compression
The paper proposes selecting models based on the principle of minimizing the total bits needed to reconstruct the dataset (Principle 13). This leads to using Kolmogorov Complexity (KC), C(x), which is the length of the shortest program for a universal Turing machine (UTM) to output x.
KC can be expressed as a two-part code (Lemma 16):
CU(x)≈minT(lU(T)+CU(x∣T))
where lU(T) is the length of the description of program T (model complexity) and CU(x∣T) is the length of x given T (data fit, often approximated by negative log-likelihood).
Since KC is uncomputable, the paper introduces Finite Codebook Complexity (FC Complexity) as a computable upper bound.
- A Finite Coding Mechanism (FCM) is a Turing machine computing a finite, prefix-free codebook.
- A Universal Finite Codebook Computer (UFCC) is a Turing machine V that can simulate any FCM from a recursively enumerable set. V takes an index k (for an FCM Tk) and data p, and outputs Tk(p).
- FC Complexity CVFC(D) for a dataset D is k,pmin{l(⟨k,p⟩):V(⟨k,p⟩)=D}.
This can be rewritten as a two-part objective:
CU(D)≤CVFC(D)+O(1)≈Tmin(2lV(T)+l(pD))
where T is an FCM simulated by V, 2lV(T) is the self-delimiting code length of T's description for V, and l(pD) is the length of the data D encoded using T's codebook (approximated by negative log-likelihood if T implements a Shannon code).
The choice of UFCC is crucial. A "good" UFCC should assign shorter description lengths lV(T) to "simpler" or more structured FCMs (e.g., those implementing causal factorizations or symmetries).
4. Case Studies: Emergence of Structure through Compression
The paper demonstrates how minimizing FC complexity under specific UFCCs leads to selecting models with causal or symmetric structures. The UFCCs considered simulate FCMs by composing a CFMP with a Huffman coding program.
- Causal Factorizations and Sparse Mechanism Shifts:
- Using a UFCC UTabCBN (which assumes probabilistic mechanisms in CFMPs are stored as full tables and feature mechanisms are projections), it's shown that a CFMP representing a causal factorization (e.g., $\PP(X,Y,E) = \PP(Y|X)\PP(X|E)\PP(E)$) has a shorter model description length lUTabCBN(α) than a CFMP β that stores the entire joint distribution $\PP(X,Y,E)$ as one large table (Proposition 17). This implies that if the data can be well-explained by such a factorization, the overall FC complexity will favor the factorized model.
- Using a UFCC UCompCBN (which allows probabilistic mechanisms to be compressible, e.g., described by a few parameters rather than a full table), it's shown that CFMPs exhibiting sparse mechanism shifts (SMS) are preferred (Proposition 18). If across multiple environments, only a few underlying mechanisms change, a CFMP that reuses shared mechanisms and only encodes the few changed ones will have a smaller model description length. The objective balances reusing mechanisms (lower lV(T)) against fitting the data perfectly (lower l(pD)).
- Symmetries:
- Using a UFCC UTabInv (which favors CFMPs that can use quotient maps ϕG:X→X/G as feature mechanisms, corresponding to group invariances), it's shown that if a distribution $\PP(X_1, X_2)$ exhibits G-invariance (e.g., $\PP(X_1|X_2) = \PP(X_1|\phi_G(X_2))$), then a CFMP α encoding this invariant factorization has a shorter model description length lUTabInv(α) than a CFMP β encoding a standard Markov factorization $\PP(X_1|X_2)\PP(X_2)$ (Proposition 19).
5. Experiments
Synthetic experiments illustrate these concepts:
- Covariate Shifts: Data is generated with $\PP(X,Y,E) = \PP(Y|X)\PP(X|E)\PP(E)$, where $\PP(Y|X)$ is fixed and $\PP(X|E)$ changes sparsely across E environments. Minimizing FC complexity (NLL + model length penalty based on the number of unique $\PP(X|E)$ mechanisms) correctly identifies or prefers models with fewer mechanisms (closer to the true sparsity) compared to minimizing NLL alone, especially with limited data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# Pseudocode for model selection in covariate shift experiment
best_fc_complexity = infinity
best_model_config = null
for k_mechanisms in range(1, max_mechanisms):
model_length_penalty = calculate_model_length(k_mechanisms) # Based on Proposition 18 logic
# Find best assignment of k_mechanisms to environments to minimize NLL
min_nll_for_k = find_best_assignment_and_nll(data, k_mechanisms)
current_fc_complexity = min_nll_for_k + model_length_penalty
if current_fc_complexity < best_fc_complexity:
best_fc_complexity = current_fc_complexity
best_model_config = k_mechanisms
# Selected model uses best_model_config mechanisms |
- Causal Discovery without Identifiability: For a linear Gaussian system X→Y where parameters change across environments, the ground truth is not identifiable from likelihood alone (as Y→X can achieve the same likelihood). FC complexity, by penalizing the number of free parameters (mechanisms), can prefer a sparser model, demonstrating a selection criterion beyond traditional identifiability.
6. Discussion and Implications
The authors suggest that if compression automatically yields causal structure, this could have significant implications for LLMs. LLMs are trained by minimizing cross-entropy (a form of compression) on vast datasets. Even if classical identifiability assumptions don't hold, the pressure to compress might force LLMs to learn "algorithmic causal" models, internalizing reusable, composable mechanisms.
Algorithmic causality is positioned not as a replacement for Pearl's causality but as a complementary framework for scenarios with:
- Correlational multi-environment data.
- Uncertainty about intervention targets.
- Finite data where model complexity is a significant factor.
The paper concludes that algorithmic causality offers a novel perspective on how causal understanding might emerge in machine learning models through the fundamental process of compression. The appendix further discusses the relationship with Minimum Description Length (MDL) and Bayesian model selection, noting that the FC complexity (a two-part code) is related to, but distinct from, Bayes codes used in refined MDL. The two-part code objective is $\min_T (-\log \mathbb{Q}(T) - \log \PP(x|T))$, which is maximizing the posterior $\PP(T|x)$ if Q(T) is a prior on models and $\PP(x|T)$ is the likelihood.
In essence, the paper provides a formal framework and initial evidence for how causal and symmetric structures might be learned implicitly as a consequence of efficient data compression, even in the absence of traditional identifiability conditions.