Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 165 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 189 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? (2507.08802v1)

Published 11 Jul 2025 in cs.LG

Abstract: The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised LLMs, our alignment maps reach 100% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

Summary

The paper demonstrates that any DNN can be aligned with any algorithm using complex, non-linear maps, questioning the reliability of causal abstraction.
Empirical findings reveal high intervention accuracy in MLPs and language models, even with randomly initialized networks.
The study underscores the need for revised interpretability frameworks that balance alignment map complexity with genuine mechanistic insight.

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Introduction

The paper "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" addresses the key challenge in machine learning interpretability: understanding the hidden decision-making processes of neural networks. It questions the sufficiency of using causal abstraction in mechanistic interpretability, especially when the alignment maps between algorithms and deep neural networks (DNNs) allow non-linearity. By showing that any neural network can be mapped to any algorithm under certain assumptions, the work suggests that removing linearity constraints leads to trivial causal abstractions, raising significant challenges for interpretable ML.

Causal Abstraction in Machine Learning

Causal abstraction aims to map a neural network's behavior to a higher-level algorithm, simplifying the understanding of complex models. The technique assumes that if an intervention in a neural network induces a behavior consistent with a specific algorithm through its alignment map, the network indeed implements that algorithm. However, most analyses restrict these maps to linear transformations based on the linear representation hypothesis. The paper challenges this assumption and explores maps with greater complexity to test the robustness and semantics of causal abstraction.

Figure 1: A visualisation of what happens when analysing causal abstractions with increasingly complex alignment maps. The more complex the map, the higher the intervention accuracy---and thus, stronger algorithm-DNN alignment.

Theoretical Contributions

The paper's primary theoretical finding is that, assuming injectivity and surjectivity conditions on the layers of DNNs, any algorithm can be shown to be a distributed abstraction of any DNN via arbitrarily complex alignment maps. This result is significant: it implies that without constraints, causal abstraction loses its utility as a tool for understanding neural representations, leading to the non-linear representation dilemma.

Key assumptions include:

Injectivity and Surjectivity: Ensures information preservation across layers and that any label can be predicted from some latent state.
Countable Inputs: Allowing strong theoretical setups by working on countable domains like natural language input.
Matching Partial-Orderings: Aligning the structural composition of neural layers with the causal graphs of algorithms.

These assumptions help prove that causal abstractions, devoid of any linearity restrictions, become vacuous.

Empirical Investigations

Empirically, the paper affirms its theoretical claims using tasks on multi-layer perceptrons (MLPs) and LLMs. It investigates the hierarchical equality and indirect object identification tasks, providing evidence that increasing the complexity of alignment maps can lead to near-perfect interchange intervention accuracy (IIA), even in randomly initialized networks.

Figure 2: IIA in the hierarchical equality task for causal abstractions trained with different alignment maps. Mean IIA over 5 seeds using various configurations.

The experiments reveal:

Hidden State Complexity: Randomly initialized models achieve high IIA with complex alignment maps, implying the underlying algorithmic representations are non-trivial.
Training Dynamics: Simple alignment maps improve in accuracy as training progresses, suggesting some degree of linear representation is adopted during learning.

Implications and Future Directions

The findings highlight a critical tension between accuracy and complexity in alignment maps. The results suggest a need to revisit probing methodologies and the assumptions underlying causal abstraction. Future research could focus on the generalizability of alignment maps and their potential overfitting issues, as the models could simply memorize data rather than providing a mechanistic understanding.

The non-linear representation dilemma reveals that, without specific assumptions on how representations are encoded, causal abstraction becomes an unreliable interpretability tool. The field might benefit by finding a balance in accuracy/complexity trade-offs and exploring additional constraints that can help harness the true interpretability potential of causal analysis.

Conclusion

In concluding, this work stresses that causal abstraction, as presently used, lacks sufficiency for meaningful mechanistic interpretability. By demonstrating that algorithm-DNN alignment is trivially attainable under broad conditions, the paper challenges the community to refine and rethink the foundational frameworks of causal abstraction in machine learning interpretable AI.