Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
Gemini 2.5 Pro Premium
42 tokens/sec
GPT-5 Medium
20 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
100 tokens/sec
DeepSeek R1 via Azure Premium
86 tokens/sec
GPT OSS 120B via Groq Premium
464 tokens/sec
Kimi K2 via Groq Premium
181 tokens/sec
2000 character limit reached

Transformer Dynamics: A neuroscientific approach to interpretability of large language models (2502.12131v1)

Published 17 Feb 2025 in cs.AI

Abstract: As artificial intelligence models have exploded in scale and capability, understanding of their internal mechanisms remains a critical challenge. Inspired by the success of dynamical systems approaches in neuroscience, here we propose a novel framework for studying computations in deep learning systems. We focus on the residual stream (RS) in transformer models, conceptualizing it as a dynamical system evolving across layers. We find that activations of individual RS units exhibit strong continuity across layers, despite the RS being a non-privileged basis. Activations in the RS accelerate and grow denser over layers, while individual units trace unstable periodic orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with attractor-like dynamics in the lower layers. These insights bridge dynamical systems theory and mechanistic interpretability, establishing a foundation for a "neuroscience of AI" that combines theoretical rigor with large-scale data analysis to advance our understanding of modern neural networks.

Summary

  • The paper introduces a novel neuroscientific approach, viewing the transformer's residual stream as a dynamical system to analyze its layer-by-layer computations.
  • Analysis of Llama 3.1 shows that individual residual stream units maintain surprising continuity across layers, while the overall residual stream vector accelerates and becomes denser.
  • Findings suggest that transformer representations follow structured, self-correcting curved trajectories, indicating stable computational patterns that actively maintain desired dynamics.

The paper "Transformer Dynamics: A neuroscientific approach to interpretability of LLMs" introduces a novel framework for studying computations in deep learning systems, drawing inspiration from dynamical systems approaches in neuroscience. The authors conceptualize the residual stream (RS) in transformer models as a dynamical system evolving across layers. Their analysis focuses on characterizing the behavior of individual RS units and the overall dynamics of the RS vector. The model used in the experiments is Llama 3.1 8B, where L=32L = 32 represents the number of layers and D=4096D = 4096 is the model dimension.

Key findings include:

  1. Continuity of Residual Stream Units: The paper demonstrates that individual units in the RS maintain strong correlations across layers, despite the RS not being a privileged basis. The Pearson correlation coefficient between activations at different layers is computed as:

    rl,l+1u=cov(hli,hl+1i)σhliσhl+1ir_{l,l+1}^u = \frac{\text{cov}(\mathbf{h}_l^{i}, \mathbf{h}_{l+1}^{i})}{\sigma{\mathbf{h}_l^i} \sigma{\mathbf{h}_{l+1}^i}}

    where:

    • rl,l+1ur_{l,l+1}^u is the Pearson correlation coefficient between layers ll and l+1l+1 for unit uu
    • cov(hli,hl+1i)\text{cov}(\mathbf{h}_l^{i}, \mathbf{h}_{l+1}^{i}) is the covariance of activations of unit ii at layers ll and l+1l+1
    • σhli\sigma{\mathbf{h}_l^i} is the standard deviation of activations of unit ii at layer ll

    This continuity suggests an unexpected stability in the representations learned by the transformer.

  2. Evolution of the Residual Stream: The RS systematically accelerates and grows denser as information progresses through the network's layers. The velocity VV of the RS vectors is calculated as Vl=hl+1hl2\|V_l\| = \|\mathbf{h}_{l+1} - \mathbf{h}_l\|_2, where 2\| \cdot \|_2 denotes the Euclidean norm. The magnitude of activations at the last token position for input sequences increases over the layers. Units tend to preserve their sign over the layers.

  3. Mutual Information Dynamics: The authors identify a sharp decrease in mutual information (MI) during early layers, indicating a transformation in how the network processes information. The mutual information is computed using kernel density estimation:

    MI(l,l+1)=p(x,y)log(p(x,y)(p(x)p(y))\mathbf{MI}(l, l+1) = \sum p(x,y) \log \bigg(\dfrac{p(x,y)}{(p(x)p(y)}\bigg)

    where:

    • MI(l,l+1)\mathbf{MI}(l, l+1) is the mutual information between layers ll and l+1l+1
    • p(x,y)p(x,y) is the joint probability density of activations at layers ll and l+1l+1
    • p(x)p(x) and p(y)p(y) are the marginal densities of activations at layers ll and l+1l+1 respectively.

    This decrease occurs simultaneously with increasing linear correlations between layers.

  4. Unstable Periodic Orbits: Individual RS units trace unstable periodic orbits in phase space, suggesting structured computational patterns at the unit level. The number of rotations in the 2D phase space (defined by the unit's activation value and its gradient across layers) is quantified by tracking the cumulative change in angle of the tangent vector along the trajectory. The total number of rotations RR is calculated as:

    R=12πlΔθlR = \frac{1}{2\pi}\sum_{l} \Delta \theta_l

    where:

    • Δθl=θl+1θl\Delta \theta_l = \theta_{l+1} - \theta_l is the angle change between consecutive points
    • θl=arctan2(Δyl,Δxl)\theta_l = \arctan2(\Delta y_l, \Delta x_l) represents the angles of tangent vectors
    • Δxl=xl+1xl\Delta x_l = x_{l+1} - x_l and Δyl=yl+1yl\Delta y_l = y_{l+1} - y_l are the tangent vectors between consecutive points.
  5. Curved Trajectories and Attractor-like Dynamics: Representations in the RS follow self-correcting curved trajectories in reduced dimensional space, with attractor-like dynamics in the lower layers. Dimensionality reduction is performed using both a compressing autoencoder (CAE) and Principal Component Analysis (PCA). Perturbation analysis in PCA space reveals that trajectories "teleported" to various points tend to return to the original trajectory, indicating self-correcting dynamics. The inverse PCA transformation is given by:

    xi=ziV[:2]T+μ\mathbf{x}_i = \mathbf{z}_i\mathbf{V}[:2]^T + \boldsymbol{\mu}

    where:

    • xi\mathbf{x}_i is the reconstructed activation vector
    • zi\mathbf{z}_i is the point in the 2D grid
    • V[:2]\mathbf{V}[:2] contains the first two principal components
    • μ\boldsymbol{\mu} is the mean of the original activation distribution.

The authors suggest that their findings support the notion that transformers develop stable computational channels that actively maintain desired trajectories, possibly self-correcting errors through their dynamics. They propose that this "neuroscience of AI" approach, which combines theoretical rigor with large-scale data analysis, can advance our understanding of modern neural networks.

The paper contrasts the dynamical systems approach with circuit-based mechanistic interpretability and methods using sparse autoencoders. It suggests that investigating the dynamics of transformers could unify theoretical insights with large-scale data analysis and experimental manipulation. Future work may involve examining the dynamics of the RS in other AI architectures, analyzing the influence of whole sequences of tokens on dynamics, and investigating dynamics over the course of model training.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com