Transformer Dynamics: A neuroscientific approach to interpretability of large language models (2502.12131v1)

Published 17 Feb 2025 in cs.AI

Abstract: As artificial intelligence models have exploded in scale and capability, understanding of their internal mechanisms remains a critical challenge. Inspired by the success of dynamical systems approaches in neuroscience, here we propose a novel framework for studying computations in deep learning systems. We focus on the residual stream (RS) in transformer models, conceptualizing it as a dynamical system evolving across layers. We find that activations of individual RS units exhibit strong continuity across layers, despite the RS being a non-privileged basis. Activations in the RS accelerate and grow denser over layers, while individual units trace unstable periodic orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with attractor-like dynamics in the lower layers. These insights bridge dynamical systems theory and mechanistic interpretability, establishing a foundation for a "neuroscience of AI" that combines theoretical rigor with large-scale data analysis to advance our understanding of modern neural networks.

Summary

The paper introduces a novel neuroscientific approach, viewing the transformer's residual stream as a dynamical system to analyze its layer-by-layer computations.
Analysis of Llama 3.1 shows that individual residual stream units maintain surprising continuity across layers, while the overall residual stream vector accelerates and becomes denser.
Findings suggest that transformer representations follow structured, self-correcting curved trajectories, indicating stable computational patterns that actively maintain desired dynamics.

The paper "Transformer Dynamics: A neuroscientific approach to interpretability of LLMs" introduces a novel framework for studying computations in deep learning systems, drawing inspiration from dynamical systems approaches in neuroscience. The authors conceptualize the residual stream (RS) in transformer models as a dynamical system evolving across layers. Their analysis focuses on characterizing the behavior of individual RS units and the overall dynamics of the RS vector. The model used in the experiments is Llama 3.1 8B, where $L = 32$ represents the number of layers and $D = 4096$ is the model dimension.

Key findings include:

Continuity of Residual Stream Units: The paper demonstrates that individual units in the RS maintain strong correlations across layers, despite the RS not being a privileged basis. The Pearson correlation coefficient between activations at different layers is computed as:

$r_{l,l+1}^u = \frac{\text{cov}(\mathbf{h}_l^{i}, \mathbf{h}_{l+1}^{i})}{\sigma{\mathbf{h}_l^i} \sigma{\mathbf{h}_{l+1}^i}}$

where:
- $r_{l,l+1}^u$ is the Pearson correlation coefficient between layers $l$ and $l+1$ for unit $u$
- $\text{cov}(\mathbf{h}_l^{i}, \mathbf{h}_{l+1}^{i})$ is the covariance of activations of unit $i$ at layers $l$ and $l+1$
- $\sigma{\mathbf{h}_l^i}$ is the standard deviation of activations of unit $i$ at layer $l$
This continuity suggests an unexpected stability in the representations learned by the transformer.
Evolution of the Residual Stream: The RS systematically accelerates and grows denser as information progresses through the network's layers. The velocity $V$ of the RS vectors is calculated as $\|V_l\| = \|\mathbf{h}_{l+1} - \mathbf{h}_l\|_2$ , where $\| \cdot \|_2$ denotes the Euclidean norm. The magnitude of activations at the last token position for input sequences increases over the layers. Units tend to preserve their sign over the layers.
Mutual Information Dynamics: The authors identify a sharp decrease in mutual information (MI) during early layers, indicating a transformation in how the network processes information. The mutual information is computed using kernel density estimation:

$\mathbf{MI}(l, l+1) = \sum p(x,y) \log \bigg(\dfrac{p(x,y)}{(p(x)p(y)}\bigg)$

where:
- $\mathbf{MI}(l, l+1)$ is the mutual information between layers $l$ and $l+1$
- $p(x,y)$ is the joint probability density of activations at layers $l$ and $l+1$
- $p(x)$ and $p(y)$ are the marginal densities of activations at layers $l$ and $l+1$ respectively.
This decrease occurs simultaneously with increasing linear correlations between layers.
Unstable Periodic Orbits: Individual RS units trace unstable periodic orbits in phase space, suggesting structured computational patterns at the unit level. The number of rotations in the 2D phase space (defined by the unit's activation value and its gradient across layers) is quantified by tracking the cumulative change in angle of the tangent vector along the trajectory. The total number of rotations $R$ is calculated as:

$R = \frac{1}{2\pi}\sum_{l} \Delta \theta_l$

where:
- $\Delta \theta_l = \theta_{l+1} - \theta_l$ is the angle change between consecutive points
- $\theta_l = \arctan2(\Delta y_l, \Delta x_l)$ represents the angles of tangent vectors
- $\Delta x_l = x_{l+1} - x_l$ and $\Delta y_l = y_{l+1} - y_l$ are the tangent vectors between consecutive points.
Curved Trajectories and Attractor-like Dynamics: Representations in the RS follow self-correcting curved trajectories in reduced dimensional space, with attractor-like dynamics in the lower layers. Dimensionality reduction is performed using both a compressing autoencoder (CAE) and Principal Component Analysis (PCA). Perturbation analysis in PCA space reveals that trajectories "teleported" to various points tend to return to the original trajectory, indicating self-correcting dynamics. The inverse PCA transformation is given by:

$\mathbf{x}_i = \mathbf{z}_i\mathbf{V}[:2]^T + \boldsymbol{\mu}$

where:
- $\mathbf{x}_i$ is the reconstructed activation vector
- $\mathbf{z}_i$ is the point in the 2D grid
- $\mathbf{V}[:2]$ contains the first two principal components
- $\boldsymbol{\mu}$ is the mean of the original activation distribution.

The authors suggest that their findings support the notion that transformers develop stable computational channels that actively maintain desired trajectories, possibly self-correcting errors through their dynamics. They propose that this "neuroscience of AI" approach, which combines theoretical rigor with large-scale data analysis, can advance our understanding of modern neural networks.

The paper contrasts the dynamical systems approach with circuit-based mechanistic interpretability and methods using sparse autoencoders. It suggests that investigating the dynamics of transformers could unify theoretical insights with large-scale data analysis and experimental manipulation. Future work may involve examining the dynamics of the RS in other AI architectures, analyzing the influence of whole sequences of tokens on dynamics, and investigating dynamics over the course of model training.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (2)

YouTube