Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 75 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 170 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment (2412.07446v3)

Published 10 Dec 2024 in cs.AI, cs.CL, cs.LG, and stat.ML

Abstract: Do generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learn a world model from which a sequence is generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT, and suggesting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences and present a confidence score. Empirical evaluation is conducted in a controlled environment using the setup and rules of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, is tested on out-of-distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases for which the GPT model generates illegal moves it also fails to capture any causal structure.

Summary

The paper demonstrates that GPT’s masked attention implicitly learns a structural causal model for each input sequence.
It details how normalizing the attention matrix in a linear-Gaussian framework creates a covariance proxy for reconstructing token causal relationships.
Empirical results on Othello game sequences reveal that higher structural confidence correlates with improved legal move prediction.

This paper investigates whether Generative Pre-trained Transformer (GPT) models, primarily trained for next-token prediction, implicitly learn a causal world model. The authors propose a causal interpretation of the masked self-attention mechanism in GPT and suggest that it leads to the learning of a distinct structural causal model (SCM) for each input sequence.

The core idea is to relate the masked attention matrix $\mathbf{A}$ from a GPT model to the inverse of the SCM's causal effect matrix. Specifically, for a linear-Gaussian SCM where variables $\boldsymbol{X}$ are generated from exogenous noise $\boldsymbol{U}$ via $\boldsymbol{X} = (\mathbf{I}-\mathbf{G})^{-1}\boldsymbol{U}$ , the covariance matrix is $\mathbf{C}_{\boldsymbol{X}} = (\mathbf{I}-\mathbf{G})^{-1} \mathbf{C}_{\boldsymbol{U}} ((\mathbf{I}-\mathbf{G})^{-1})^\top$ . Assuming $\mathbf{C}_{\boldsymbol{U}}$ is identity (e.g., independent exogenous noise) and leveraging the lower-triangular structure of the masked attention matrix $\mathbf{A}$ , the paper proposes that the normalized attention matrix $\mathbf{D}^{-1}\mathbf{A}$ (where $\mathbf{D} = \text{diag}(\mathbf{A})$ ) corresponds to $(\mathbf{I}-\mathbf{G})^{-1}$ . Thus, properties derived from $(\mathbf{D}^{-1}\mathbf{A})(\mathbf{D}^{-1}\mathbf{A})^\top$ can be seen as reflecting the covariance structure induced by an underlying SCM. The causal structure is implicitly encoded in how attention weights combine information from previous tokens.

Based on this interpretation, the paper proposes using GPT's attention matrices for zero-shot causal structure learning. Since the masked attention imposes a strict temporal order (token $i$ can only attend to tokens $j \le i$ ), this temporal order can be treated as a valid causal topological order. This allows for efficient constraint-based causal discovery by testing conditional independence relations between tokens. The paper adapts the ABCD method [rohekar2024causal] and proposes a recursive causal discovery algorithm (\algref{alg:RCD}) that utilizes the known temporal order and the ICD algorithm [rohekar2021iterative] to reconstruct a Partial Ancestral Graph (PAG) from the covariance matrix estimated from the attention matrix of a single sequence. This means a causal graph representing the relationships between tokens in a sequence can be extracted directly from the model's internal state during inference.

To evaluate how well attention represents a coherent causal structure, the paper introduces a structural confidence metric $R(\mathbf{A})$ . This metric is based on the entropy of p-values from the conditional independence (CI) tests performed during causal discovery. $R(\mathbf{A}) = H_{\text{ind}} - H_{\text{dep}}$ , where $H_{\text{ind}}$ is the entropy of p-values $\ge \alpha$ (indicating independence, null hypothesis accepted) and $H_{\text{dep}}$ is the entropy of p-values $< \alpha$ (indicating dependence, null hypothesis rejected). A higher $R$ suggests a clearer distinction between dependent and independent relations, implying a more discernible causal structure captured by the attention matrix.

The empirical evaluation is conducted using a GPT model trained on Othello game sequences [liemergent]. The key aspect of the setup is testing the model on sequences of legal moves that do not necessarily follow human winning strategies, meaning they are outside the training distribution support but adhere to the game's underlying rules. The goal is to see if the model captures the rules (the causal mechanisms) rather than just strategic patterns. The experiments focus on sequence lengths [10, 30] where the model's accuracy in predicting legal moves is lower than average, suggesting less reliance on simple memorization.

The results show a correlation between the structural confidence score $R$ extracted from the attention matrix and the model's accuracy in generating a legal next move. For sequence lengths greater than 15, the accuracy of generating legal moves tends to increase monotonically with the structural confidence score (\figref{fig:legal_vs_structual_distinction}). This suggests that when the attention mechanism encodes a more distinct causal structure (higher $R$ ), the model is better at adhering to the underlying game rules.

An ablation paper further demonstrates the importance of utilizing conditional independence tests (which reveal causal structure) over simple pairwise correlations (which only reveal associations) (\figref{fig:structual_distinction}). The difference in structural confidence between sequences leading to legal vs. illegal predictions is statistically significant for CI tests used in causal graph reconstruction (case d), while less so or not at all for simple pairwise correlations (case a).

Practical Implementation Aspects:

Accessing Attention Weights: Implementing this requires access to the attention weights

\mathbf{A}

from the last attention layer of the GPT model during inference. Many popular deep learning frameworks (PyTorch, TensorFlow) allow extracting these intermediate values.

# Example (conceptual) using PyTorch
model = MyGPTModel(...)
# Assume model is trained and loaded
inputs = tokenizer("sequence of tokens", return_tensors="pt")

# Configure the model to return attention weights
outputs = model(**inputs, output_attentions=True)

# Access attention weights from the last layer (assuming it's the last element in the tuple)
# Shape: (batch_size, num_heads, sequence_length, sequence_length)
last_layer_attention = outputs.attentions[-1]

# Select a specific sequence and head (e.g., first sequence, first head)
attention_matrix = last_layer_attention[0, 0, :, :]

# Normalize (D_inv * A)
D_inv_A = attention_matrix / attention_matrix.sum(dim=-1, keepdim=True) # Row normalization
# Need to adjust for the paper's specific D_inv definition if it's different from row sum normalization
# The paper defines D as diag(A), which for row-normalized A is just identity if A is strictly lower triangular
# However, GPT attention is masked, not strictly lower triangular before softmax. After softmax and masking it is lower triangular.
# The exact D_inv A calculation in the paper based on reversing softmax normalization from a lower-triangular result needs careful implementation based on model specifics.
# Assuming the paper's D_inv A is correctly derived to represent (I-G)^-1
covariance_proxy = torch.matmul(D_inv_A, D_inv_A.transpose(-1, -2))

Estimating Covariance Proxy: Calculate $\mathbf{C} = (\mathbf{D}^{-1}\mathbf{A})(\mathbf{D}^{-1}\mathbf{A})^\top$ . This matrix serves as the input to the causal discovery algorithm.

Causal Discovery: Apply a constraint-based causal discovery algorithm, such as ICD rohekar2021iterative, using the estimated covariance matrix

\mathbf{C}

and potentially the known temporal order as a constraint. Libraries like causality-lab (mentioned by the authors) or pgmpy, dowhy, castle might offer relevant implementations or components.

# Conceptual steps for causal discovery
# 1. Convert covariance_proxy tensor to numpy array
# 2. Use a causal discovery library
import numpy as np
covariance_np = covariance_proxy.detach().cpu().numpy()

# 3. Define variables (tokens in sequence)
variables = [f"token_{i}" for i in range(sequence_length)]

# 4. Apply a causal discovery algorithm suitable for known order (e.g., adapting PC/FCI or using ICD)
#    This involves performing CI tests based on the covariance matrix
#    Example (using pgmpy - need to adapt for the specific algorithm like ICD/RCD):
#    from pgmpy.estimators import PC
#    est = PC(data=None, independence_test='gaussian_ci', cov=covariance_np)
#    model = est.estimate(significance_level=0.05) # This estimates DAG/PAG
#    Note: The paper's RCD/ICD is specifically designed for potentially latent confounders and selection bias, and leverages the known order. Implementing this precisely might require using or replicating parts of the Causality Lab library.

Calculating Structural Confidence: Collect the p-values from the CI tests performed during causal discovery. Group them into

p \ge \alpha

and

p < \alpha

. Calculate the entropy for each group and compute

R = H_{\text{ind}} - H_{\text{dep}}

# Conceptual step for calculating structural confidence
# Assume 'p_values' is a list/array of p-values from the CI tests
alpha = 0.05 # Significance level used in CI tests

p_ind = [p for p in p_values if p >= alpha]
p_dep = [p for p in p_values if p < alpha]

# Function to calculate entropy (handle log(0) if needed)
def calculate_entropy(probs):
    # Simple approach assuming p-values are probabilities for entropy calculation
    # Need proper binning or distribution modeling for true entropy
    # The paper's formula uses sum p log p directly on p-values which isn't standard entropy
    # Let's follow the paper's definition: H = - sum_{p in set} p log p
    entropy = -sum([p * np.log(p) if p > 0 else 0 for p in probs])
    return entropy
    
H_ind = calculate_entropy(p_ind)
H_dep = calculate_entropy(p_dep)

structural_confidence_R = H_ind - H_dep

Application: The derived structural confidence score $R$ could be used as a diagnostic tool. If $R$ is low for a given sequence, it might indicate that the model's internal representation of the sequence's underlying structure is weak, potentially correlating with higher likelihood of generating incorrect or "illegal" tokens (in the sense of violating world rules, not just statistical patterns). This could potentially help in detecting or understanding sources of "hallucinations" or factual errors in generative models.

Computational Considerations: Causal discovery algorithms, especially constraint-based ones involving numerous conditional independence tests, can be computationally expensive. The number of tests typically grows polynomially with the number of variables (sequence length). For very long sequences, extracting and processing attention matrices and running causal discovery might add significant overhead. The recursive approach used in the paper likely helps manage complexity by leveraging the known topological order.

The paper's findings suggest that GPT models may go beyond surface statistics and capture aspects of the generative process (causal structure) through their attention mechanism, especially in the last layers. This offers a potential avenue for interpreting model behavior and understanding the basis of emergent capabilities.