Residual Stream Analysis with Multi-Layer SAEs (2409.04185v2)

Published 6 Sep 2024 in cs.LG and cs.CL

Abstract: Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer LLMs. However, SAEs are usually trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer. Given that the residual stream is understood to preserve information across layers, we expected MLSAE latents to `switch on' at a token position and remain active at later layers. Interestingly, we find that individual latents are often active at a single layer for a given token or prompt, but this layer may differ for different tokens or prompts. We quantify these phenomena by defining a distribution over layers and considering its variance. We find that the variance of the distributions of latent activations over layers is about two orders of magnitude greater when aggregating over tokens compared with a single token. For larger underlying models, the degree to which latents are active at multiple layers increases, which is consistent with the fact that the residual stream activation vectors at adjacent layers become more similar. Finally, we relax the assumption that the residual stream basis is the same at every layer by applying pre-trained tuned-lens transformations, but our findings remain qualitatively similar. Our results represent a new approach to understanding how representations change as they flow through transformers. We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.

Authors (4)

Tim Lawson (4 papers)
Lucy Farnik (5 papers)
Conor Houghton (22 papers)
Laurence Aitchison (66 papers)

Citations (1)

View on Semantic Scholar

Summary

Overview of Residual Stream Analysis with Multi-Layer SAEs

The paper "Residual Stream Analysis with Multi-Layer SAEs" introduces a significant methodological advancement in the paper of transformer LLMs. The authors propose a multi-layer sparse autoencoder (MLSAE) to analyze the internal representations within transformers, addressing key limitations of standard SAEs that target single-layer activation vectors.

Multi-Layer Sparse Autoencoders

Motivation: Standard SAEs are typically trained on activation vectors from individual layers, limiting their utility in studying interlayer information flow. The residual stream perspective suggests that information is preserved and selectively processed across transformer layers, warranting a unified approach to analyzing multi-layer information flow.

Contribution: The key contribution is the MLSAE, a single SAE trained on residual stream activations across all transformer layers simultaneously. This approach not only maintains the reconstruction performance of standard SAEs but also reveals how semantic information propagates across layers in transformers.

Methodology

The MLSAE architecture involves training a single SAE on activation vectors from every transformer layer by treating these vectors as separate training examples. This is conceptually akin to training individual SAEs separately but with shared parameters across layers, facilitating the identification of features active at multiple layers.

Key aspects of the methodology include:

Residual Stream Perspective: The authors leverage the residual stream framework, where transformers maintain and process information using self-attention and MLP layers.
Activation of Features Across Layers: The paper identifies and analyzes features active across multiple transformer layers, both for aggregated training data and individual prompts.
Model and Training Configuration: The MLSAEs are trained on GPT-style models from the Pythia suite, with an architecture that uses ReLU and TopK activation functions to maintain sparse latent representations.

Results

The results provide insights into the flow of information in transformers:

Cosine Similarity: Adjacent layer activation vectors in larger models exhibit higher cosine similarities, indicating a greater preservation of information across layers with increased model size.
Feature Activation: For aggregated training data, many features are active across multiple layers. However, individual prompts show a higher proportion of features active at specific layers, suggesting context-dependent feature specificity.
Normalization and Reconstruction Metrics: MLSAEs achieve comparable reconstruction error (both FVU and MSE) to single-layer SAEs. They also demonstrate similar downstream impacts in terms of increased cross-entropy loss and KL divergence when replacing original activations with reconstructed ones.
MMCS: The Mean Max Cosine Similarity (MMCS) remains relatively stable across different expansion factors, indicating robustness in detecting sparse features despite changes in model configurations.

Implications and Future Directions

The introduction of MLSAEs holds significant theoretical and practical implications for understanding transformer operations:

Theoretical Understanding: By capturing how features are represented and evolve across layers, this work enhances mechanistic interpretability, facilitating the identification of meaningful circuits within transformer architectures.
Practical Applications: Improved feature interpretation could impact the development of steering vectors and other tools aimed at making neural network behavior more transparent and controllable.
Future Research: Further work could explore the relaxation of the fixed sparse basis assumption, investigate the scalability of MLSAEs in larger models, and refine the approach to include slight variations in feature representation across layers.

In conclusion, the MLSAE framework constitutes a potent tool for dissecting the layered operations of transformers, offering nuanced insights into how linguistic and semantic information is managed within these models. Such advancements not only push the boundaries of neural network interpretability but also lay the groundwork for the next phase of AI developments.