Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 75 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 170 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Algorithmic causal structure emerging through compression (2502.04210v3)

Published 6 Feb 2025 in cs.LG, cs.AI, cs.CC, cs.IT, and math.IT

Abstract: We explore the relationship between causality, symmetry, and compression. We build on and generalize the known connection between learning and compression to a setting where causal models are not identifiable. We propose a framework where causality emerges as a consequence of compressing data across multiple environments. We define algorithmic causality as an alternative definition of causality when traditional assumptions for causal identifiability do not hold. We demonstrate how algorithmic causal and symmetric structures can emerge from minimizing upper bounds on Kolmogorov complexity, without knowledge of intervention targets. We hypothesize that these insights may also provide a novel perspective on the emergence of causality in machine learning models, such as LLMs, where causal relationships may not be explicitly identifiable.

Summary

The paper introduces algorithmic causality, showing that causal structures can emerge by selecting models based on data compression instead of traditional identifiability.
It proposes a framework using Conditional Feature-Mechanism Programs (CFMPs) that combine probabilistic and feature mechanisms to represent causal factorizations and capture sparse mechanism shifts.
Experiments reveal that minimizing Finite Codebook Complexity favors sparser, invariant models across environments, offering insights applicable to causal discovery and large-scale language models.

This paper explores the relationship between causality, symmetry, and compression, proposing a framework where causal structures emerge from compressing data across multiple environments, even when traditional causal identifiability assumptions are not met. The authors introduce "algorithmic causality" as an alternative definition of causality applicable in such scenarios.

The core idea is that learning and compression are deeply linked. While traditional causal discovery relies on identifiability (the ability to uniquely determine the causal graph from data), this often requires strong, sometimes unrealistic, assumptions. This paper investigates what can be said about causality when these assumptions fail, particularly in multi-environment settings where mechanisms might be shared or shift sparsely.

1. Identifiability, Compression, and Their Limitations

The paper first reviews the connection between identifiability in causal discovery and compression. Identifiable models, under which data has maximal likelihood, correspond to models whose distributions have minimal cross-entropy with the data. By Shannon's source coding theorem, minimal cross-entropy implies the shortest average coding length for i.i.d. data. Thus, identifiability research aims to justify compression (minimum cross-entropy) as the correct model selection method.

However, this framework has limitations:

Hard Priors: Identifiability often requires strong assumptions (hard priors) about distribution classes or knowledge of intervention targets, restricting the model space.
Intervention Knowledge: Many identifiability results depend on knowing intervention types or targets. Without this, multi-environment data is essentially correlational, limiting identifiability.
Model Complexity: Cross-entropy only accounts for the data-to-model coding length, neglecting the "codebook length" or the complexity of the model itself. Two models computing the same distribution might have vastly different internal complexities.

2. Algorithmic Causality

To address these limitations, the paper introduces algorithmic causality. The intuition is that even if two models (e.g., Turing machines) produce the same probability distribution, their internal structure and thus their description length (complexity) can differ. Compression, considering total description length, can then prefer one over the other.

A key concept is the Conditional Feature-Mechanism Program (CFMP), a class of Turing machines designed to compute probability distributions. A CFMP $\alpha$ operates in three steps:

Generation: $\alpha$ generates a set of probabilistic mechanisms $\Pcal_\alpha$ (Turing machines computing conditional probability maps, e.g., $f(value|condition) \to probability$ ) and feature mechanisms $\Phi_\alpha$ (Turing machines computing feature maps, e.g., $x \mapsto \phi(x)$ ).
Featurization: $\alpha$ combines probabilistic mechanisms with feature mechanisms to create "featurized mechanisms." For example, a probabilistic mechanism $f$ and feature mechanisms $\phi, \psi$ can form $x \mapsto f(\phi(x)|\psi(x))$ . This step allows for creating reusable components.
Computation: For a given input $x$ , $\alpha$ selects a sequence of featurized mechanisms and multiplies their outputs to compute $\PP(x)$. If hidden variables are involved, it marginalizes over them.

Examples of models that can be represented as CFMPs include:

Causal Bayesian Networks (CBNs)
Context-Specific Bayesian Networks
Causal Representation Learning models (involving hidden variables)
$G$ -invariant and $G$ -equivariant learning models
Statistical density estimators (where the entire joint distribution is a single mechanism)

Algorithmic Causality (Definition 12, informal): $X_i$ algorithmically causes $X_j$ if a model selection method (like compression) selects a Turing machine (e.g., a CFMP) that, for a given input, uses a subprogram of the form "If $X_i=\dots$ then $X_j=\dots$ " (i.e., a featurized mechanism $f(\dots X_j \dots | \dots X_i \dots)$ ) and not the reverse. This is a property of the selected computational model, not just the data distribution.

3. Learning Algorithmic Causality by Compression

The paper proposes selecting models based on the principle of minimizing the total bits needed to reconstruct the dataset (Principle 13). This leads to using Kolmogorov Complexity (KC), $C(x)$ , which is the length of the shortest program for a universal Turing machine (UTM) to output $x$ . KC can be expressed as a two-part code (Lemma 16):

$C_U(x) \approx \min_{T} (l_U(T) + C_U(x|T))$

where $l_U(T)$ is the length of the description of program $T$ (model complexity) and $C_U(x|T)$ is the length of $x$ given $T$ (data fit, often approximated by negative log-likelihood).

Since KC is uncomputable, the paper introduces Finite Codebook Complexity (FC Complexity) as a computable upper bound.

A Finite Coding Mechanism (FCM) is a Turing machine computing a finite, prefix-free codebook.
A Universal Finite Codebook Computer (UFCC) is a Turing machine $V$ that can simulate any FCM from a recursively enumerable set. $V$ takes an index $k$ (for an FCM $T_k$ ) and data $p$ , and outputs $T_k(p)$ .
FC Complexity $C^{FC}_V(D)$ for a dataset $D$ is $\min_{k,p} \{l(\langle k,p \rangle): V(\langle k,p \rangle)=D\}$ . This can be rewritten as a two-part objective:

$C_U(D) \leq C^{FC}_V(D) + O(1) \approx \min_{T} (2l_V(T) + l(p_D))$

where $T$ is an FCM simulated by $V$ , $2l_V(T)$ is the self-delimiting code length of $T$ 's description for $V$ , and $l(p_D)$ is the length of the data $D$ encoded using $T$ 's codebook (approximated by negative log-likelihood if $T$ implements a Shannon code).

The choice of UFCC is crucial. A "good" UFCC should assign shorter description lengths $l_V(T)$ to "simpler" or more structured FCMs (e.g., those implementing causal factorizations or symmetries).

4. Case Studies: Emergence of Structure through Compression

The paper demonstrates how minimizing FC complexity under specific UFCCs leads to selecting models with causal or symmetric structures. The UFCCs considered simulate FCMs by composing a CFMP with a Huffman coding program.

Causal Factorizations and Sparse Mechanism Shifts:
- Using a UFCC $U_{\text{TabCBN}}$ (which assumes probabilistic mechanisms in CFMPs are stored as full tables and feature mechanisms are projections), it's shown that a CFMP representing a causal factorization (e.g., $\PP(X,Y,E) = \PP(Y|X)\PP(X|E)\PP(E)$) has a shorter model description length $l_{U_{\text{TabCBN}}}(\alpha)$ than a CFMP $\beta$ that stores the entire joint distribution $\PP(X,Y,E)$ as one large table (Proposition 17). This implies that if the data can be well-explained by such a factorization, the overall FC complexity will favor the factorized model.
- Using a UFCC $U_{\text{CompCBN}}$ (which allows probabilistic mechanisms to be compressible, e.g., described by a few parameters rather than a full table), it's shown that CFMPs exhibiting sparse mechanism shifts (SMS) are preferred (Proposition 18). If across multiple environments, only a few underlying mechanisms change, a CFMP that reuses shared mechanisms and only encodes the few changed ones will have a smaller model description length. The objective balances reusing mechanisms (lower $l_V(T)$ ) against fitting the data perfectly (lower $l(p_D)$ ).
Symmetries:
- Using a UFCC $U_{\text{TabInv}}$ (which favors CFMPs that can use quotient maps $\phi_G: \mathcal{X} \to \mathcal{X}/G$ as feature mechanisms, corresponding to group invariances), it's shown that if a distribution $\PP(X_1, X_2)$ exhibits $G$ -invariance (e.g., $\PP(X_1|X_2) = \PP(X_1|\phi_G(X_2))$), then a CFMP $\alpha$ encoding this invariant factorization has a shorter model description length $l_{U_{\text{TabInv}}}(\alpha)$ than a CFMP $\beta$ encoding a standard Markov factorization $\PP(X_1|X_2)\PP(X_2)$ (Proposition 19).

5. Experiments

Synthetic experiments illustrate these concepts:

Covariate Shifts: Data is generated with $\PP(X,Y,E) = \PP(Y|X)\PP(X|E)\PP(E)$, where $\PP(Y|X)$ is fixed and $\PP(X|E)$ changes sparsely across

E

environments. Minimizing FC complexity (NLL + model length penalty based on the number of unique $\PP(X|E)$ mechanisms) correctly identifies or prefers models with fewer mechanisms (closer to the true sparsity) compared to minimizing NLL alone, especially with limited data.

# Pseudocode for model selection in covariate shift experiment
best_fc_complexity = infinity
best_model_config = null

for k_mechanisms in range(1, max_mechanisms):
    model_length_penalty = calculate_model_length(k_mechanisms) # Based on Proposition 18 logic
    
    # Find best assignment of k_mechanisms to environments to minimize NLL
    min_nll_for_k = find_best_assignment_and_nll(data, k_mechanisms)
    
    current_fc_complexity = min_nll_for_k + model_length_penalty
    
    if current_fc_complexity < best_fc_complexity:
        best_fc_complexity = current_fc_complexity
        best_model_config = k_mechanisms
        
# Selected model uses best_model_config mechanisms

Causal Discovery without Identifiability: For a linear Gaussian system $X \to Y$ where parameters change across environments, the ground truth is not identifiable from likelihood alone (as $Y \to X$ can achieve the same likelihood). FC complexity, by penalizing the number of free parameters (mechanisms), can prefer a sparser model, demonstrating a selection criterion beyond traditional identifiability.

6. Discussion and Implications

The authors suggest that if compression automatically yields causal structure, this could have significant implications for LLMs. LLMs are trained by minimizing cross-entropy (a form of compression) on vast datasets. Even if classical identifiability assumptions don't hold, the pressure to compress might force LLMs to learn "algorithmic causal" models, internalizing reusable, composable mechanisms.

Algorithmic causality is positioned not as a replacement for Pearl's causality but as a complementary framework for scenarios with:

Correlational multi-environment data.
Uncertainty about intervention targets.
Finite data where model complexity is a significant factor.

The paper concludes that algorithmic causality offers a novel perspective on how causal understanding might emerge in machine learning models through the fundamental process of compression. The appendix further discusses the relationship with Minimum Description Length (MDL) and Bayesian model selection, noting that the FC complexity (a two-part code) is related to, but distinct from, Bayes codes used in refined MDL. The two-part code objective is $\min_T (-\log \mathbb{Q}(T) - \log \PP(x|T))$, which is maximizing the posterior $\PP(T|x)$ if $\mathbb{Q}(T)$ is a prior on models and $\PP(x|T)$ is the likelihood.

In essence, the paper provides a formal framework and initial evidence for how causal and symmetric structures might be learned implicitly as a consequence of efficient data compression, even in the absence of traditional identifiability conditions.