Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models (2502.03032v2)

Published 5 Feb 2025 in cs.LG and cs.CL

Abstract: We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of LLMs, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of LLMs.

Summary

  • The paper introduces a data-free cosine similarity approach to map SAE feature evolution across layers, revealing hidden computational pathways.
  • It constructs flow graphs that decode how features are processed by MLP and attention modules, highlighting feature birth and refinement.
  • The research demonstrates that multi-layer interventions improve model steering and control over text generation compared to single-layer methods.

The paper introduces a methodology for systematically analyzing and mapping features across layers in LLMs. It extends previous work on inter-layer feature connections, providing a more granular understanding of feature evolution and enabling targeted steering of model behavior. The approach utilizes a data-free cosine similarity technique to trace how SAE (Sparse Autoencoder) features persist, transform, or emerge across different layers and modules. This results in flow graphs that offer mechanistic insights into model computations and allow for direct manipulation of text generation through feature amplification or suppression.

The authors highlight three main contributions:

  1. Cross-Layer Feature Evolution: The paper leverages pre-trained SAEs to isolate interpretable, monosemantic directions and then uses cosine similarity between decoder weights to track how these directions evolve across layers. This reveals patterns of feature birth and refinement that are not apparent in single-layer analyses.
  2. Mechanistic Properties of Flow Graph: By constructing a flow graph, the method uncovers computational pathways that resemble circuits, where MLP and attention modules introduce or modify features.
  3. Multi-Layer Model Steering: The research demonstrates that flow graphs can enhance model steering by targeting multiple SAE features simultaneously, offering improved understanding and control over the steering outcome.

The methodology builds upon the linear representation hypothesis, which posits that hidden states, hRdh \in \mathbb{R}^d, can be represented as sparse linear combinations of features, fRdf \in \mathbb{R}^d, lying in linear subspaces FRdF \subset \mathbb{R}^d. hh: hidden states Rd\mathbb{R}^d: d-dimensional real vector space ff: features FF: linear subspaces

SAEs are employed to decompose the model's hidden state into a sparse weighted sum of interpretable features. Given a collection of one-dimensional features Fi(P)F_i^{(P)} learned by an SAE at position PP in the model, the SAE can be represented as:

z=σ(Wench+benc)z = \sigma(W_{enc}h + b_{enc})

zz: feature activation σ\sigma: nonlinear activation function WencW_{enc}: encoder weights hh: model's hidden state bencb_{enc}: encoder bias

h=Wdecz+bdech' = W_{dec}z + b_{dec}

hh': SAE's reconstruction of the hidden state WdecW_{dec}: decoder weights bdecb_{dec}: decoder bias

The SAEs are trained to reconstruct model hidden states while enforcing sparse feature activations, using a loss function:

L=Lrec(h,h)+Lreg(z)L = L_{rec}(h, h') + L_{reg}(z)

LL: loss LrecL_{rec}: reconstruction loss LregL_{reg}: regularization loss

where LrecL_{rec} is typically hh2|h - h'|_2, and Lreg(z)L_{reg}(z) is an l0l_0 proxy.

The paper also discusses the use of JumpReLU and Top-K activations to control sparsity and introduces transcoders as interpretable approximations of MLPs.

To identify similar features between different layers, a permutation matrix P(AB)P(A \rightarrow B) is defined to map feature indices from layer AA to layer BB:

P(AB)=argminPΠFi=1dWdeci,:(B)Wdeci,:(A)P(AB)2P(A \rightarrow B) = \underset{P \in \Pi_{|F|}}{\arg \min} \sum_{i=1}^{d} ||W_{dec_{i,:}}^{(B)} - W_{dec_{i,:}}^{(A)}P(A \rightarrow B)||_2

P(AB)P(A \rightarrow B): permutation matrix from layer A to layer B

ΠF\Pi_{|F|}: set of permutation matrices of size F×F|F| \times |F|

Wdec(A)W_{dec}^{(A)}: decoder weights of the SAE trained on the residual stream after layer AA

Wdec(B)W_{dec}^{(B)}: decoder weights of the SAE trained on the residual stream after layer BB

The method in this paper focuses on finding shared features by discovering a mapping TAB:F(A)F(B)T_{A \rightarrow B}: F^{(A)} \rightarrow F^{(B)} between features at different levels of the model. It employs cosine similarity between decoder weights as a similarity metric. Given a feature embedding fRdf \in \mathbb{R}^d trained at position AA and decoder weights W(B)Rd×FW^{(B)} \in \mathbb{R}^{d \times |F|} trained at position BB, the matched feature index is found by:

j=argmaxj(fWdecj(B))j = \underset{j}{\arg \max} (f \cdot W_{dec_j}^{(B)})

The transformation T(AB)T(A \rightarrow B) is then defined as:

T(AB)=Ix>0(topk(fTWdec(B)))T(A \rightarrow B) = \mathbb{I}_{x>0}(\text{topk}(f^T W_{dec}^{(B)}))

T(AB)T(A \rightarrow B): transformation from position A to position B Ix>0\mathbb{I}_{x>0}: indicator function topk: zeroes out values below the kth order statistic

The approach tracks feature evolution by considering four main computational points in a transformer layer: the layer output RLR_L, the MLP output MM, the attention output AA, and the previous layer output RL1R_{L-1}. By computing the similarity s(P)s(P) between a target feature and each of these points, the method infers how the feature relates to the previous layer or modules.

s(P)=maxj(fWdecj(P))s(P) = \max_j (f \cdot W_{dec_j}^{(P)})

s(P)s(P): similarity between the target feature and point PP

ff: feature embedding

Wdec(P)W_{dec}^{(P)}: decoder weights at point PP

Based on these similarity scores, the origin and transformation of features are categorized into four scenarios:

  • High s(R)s(R) and low s(M)s(M), s(A)s(A): The feature likely existed in RL1R_{L-1} and was translated to RLR_L.
  • High s(R)s(R) and high s(M)s(M) or s(A)s(A): The feature was likely processed by the MLP or attention module.
  • Low s(R)s(R) but high s(M)s(M) or s(A)s(A): The feature may be newborn, created by the MLP or attention module.
  • Low s(R)s(R) and low s(M)s(M), s(A)s(A): The feature cannot be easily explained by maximum cosine similarity alone.

The paper addresses the challenge of long-range feature matching by performing short-range matching in consecutive layers and composing the resulting transformations to construct flow graphs. These graphs trace the evolution of a feature's semantic properties throughout the model.

The authors conduct experiments using the Gemma 2 2B model and the Gemma Scope SAE pack, along with LLama Scope. They analyze how residual features emerge, propagate, and can be manipulated across model layers. The experiments aim to determine how features originate in different model components, assess whether deactivating a predecessor feature deactivates its descendant, and use these insights to steer the model's generation toward or away from specific topics.

The results of the experiments indicate that:

  • Similarity of linear directions is a good proxy for activation correlation.
  • The structure of feature groups differs across layers, reflecting information processing within the model.
  • The top1 similarity provides valuable information about causal dependencies.
  • Multi-layer interventions affect the model more than single-layer approaches in achieving the desired outcome.

The paper concludes by highlighting the potential of the proposed method for identifying and interpreting the computational graph of a model, enabling precise control over the model's internal structure and opening new avenues for zero-shot steering.