Analyze Feature Flow to Enhance Interpretation and Steering in Language Models (2502.03032v2)

Published 5 Feb 2025 in cs.LG and cs.CL

Abstract: We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of LLMs, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of LLMs.

Summary

The paper introduces a data-free cosine similarity approach to map SAE feature evolution across layers, revealing hidden computational pathways.
It constructs flow graphs that decode how features are processed by MLP and attention modules, highlighting feature birth and refinement.
The research demonstrates that multi-layer interventions improve model steering and control over text generation compared to single-layer methods.

The paper introduces a methodology for systematically analyzing and mapping features across layers in LLMs. It extends previous work on inter-layer feature connections, providing a more granular understanding of feature evolution and enabling targeted steering of model behavior. The approach utilizes a data-free cosine similarity technique to trace how SAE (Sparse Autoencoder) features persist, transform, or emerge across different layers and modules. This results in flow graphs that offer mechanistic insights into model computations and allow for direct manipulation of text generation through feature amplification or suppression.

The authors highlight three main contributions:

Cross-Layer Feature Evolution: The paper leverages pre-trained SAEs to isolate interpretable, monosemantic directions and then uses cosine similarity between decoder weights to track how these directions evolve across layers. This reveals patterns of feature birth and refinement that are not apparent in single-layer analyses.
Mechanistic Properties of Flow Graph: By constructing a flow graph, the method uncovers computational pathways that resemble circuits, where MLP and attention modules introduce or modify features.
Multi-Layer Model Steering: The research demonstrates that flow graphs can enhance model steering by targeting multiple SAE features simultaneously, offering improved understanding and control over the steering outcome.

The methodology builds upon the linear representation hypothesis, which posits that hidden states, $h \in \mathbb{R}^d$ , can be represented as sparse linear combinations of features, $f \in \mathbb{R}^d$ , lying in linear subspaces $F \subset \mathbb{R}^d$ . $h$ : hidden states $\mathbb{R}^d$ : d-dimensional real vector space $f$ : features $F$ : linear subspaces

SAEs are employed to decompose the model's hidden state into a sparse weighted sum of interpretable features. Given a collection of one-dimensional features $F_i^{(P)}$ learned by an SAE at position $P$ in the model, the SAE can be represented as:

$z = \sigma(W_{enc}h + b_{enc})$

$z$ : feature activation $\sigma$ : nonlinear activation function $W_{enc}$ : encoder weights $h$ : model's hidden state $b_{enc}$ : encoder bias

$h' = W_{dec}z + b_{dec}$

$h'$ : SAE's reconstruction of the hidden state $W_{dec}$ : decoder weights $b_{dec}$ : decoder bias

The SAEs are trained to reconstruct model hidden states while enforcing sparse feature activations, using a loss function:

$L = L_{rec}(h, h') + L_{reg}(z)$

$L$ : loss $L_{rec}$ : reconstruction loss $L_{reg}$ : regularization loss

where $L_{rec}$ is typically $|h - h'|_2$ , and $L_{reg}(z)$ is an $l_0$ proxy.

The paper also discusses the use of JumpReLU and Top-K activations to control sparsity and introduces transcoders as interpretable approximations of MLPs.

To identify similar features between different layers, a permutation matrix $P(A \rightarrow B)$ is defined to map feature indices from layer $A$ to layer $B$ :

$P(A \rightarrow B) = \underset{P \in \Pi_{|F|}}{\arg \min} \sum_{i=1}^{d} ||W_{dec_{i,:}}^{(B)} - W_{dec_{i,:}}^{(A)}P(A \rightarrow B)||_2$

$P(A \rightarrow B)$ : permutation matrix from layer A to layer B

$\Pi_{|F|}$ : set of permutation matrices of size $|F| \times |F|$

$W_{dec}^{(A)}$ : decoder weights of the SAE trained on the residual stream after layer $A$

$W_{dec}^{(B)}$ : decoder weights of the SAE trained on the residual stream after layer $B$

The method in this paper focuses on finding shared features by discovering a mapping $T_{A \rightarrow B}: F^{(A)} \rightarrow F^{(B)}$ between features at different levels of the model. It employs cosine similarity between decoder weights as a similarity metric. Given a feature embedding $f \in \mathbb{R}^d$ trained at position $A$ and decoder weights $W^{(B)} \in \mathbb{R}^{d \times |F|}$ trained at position $B$ , the matched feature index is found by:

$j = \underset{j}{\arg \max} (f \cdot W_{dec_j}^{(B)})$

The transformation $T(A \rightarrow B)$ is then defined as:

$T(A \rightarrow B) = \mathbb{I}_{x>0}(\text{topk}(f^T W_{dec}^{(B)}))$

$T(A \rightarrow B)$ : transformation from position A to position B $\mathbb{I}_{x>0}$ : indicator function topk: zeroes out values below the kth order statistic

The approach tracks feature evolution by considering four main computational points in a transformer layer: the layer output $R_L$ , the MLP output $M$ , the attention output $A$ , and the previous layer output $R_{L-1}$ . By computing the similarity $s(P)$ between a target feature and each of these points, the method infers how the feature relates to the previous layer or modules.

$s(P) = \max_j (f \cdot W_{dec_j}^{(P)})$

$s(P)$ : similarity between the target feature and point $P$

$f$ : feature embedding

$W_{dec}^{(P)}$ : decoder weights at point $P$

Based on these similarity scores, the origin and transformation of features are categorized into four scenarios:

High $s(R)$ and low $s(M)$ , $s(A)$ : The feature likely existed in $R_{L-1}$ and was translated to $R_L$ .
High $s(R)$ and high $s(M)$ or $s(A)$ : The feature was likely processed by the MLP or attention module.
Low $s(R)$ but high $s(M)$ or $s(A)$ : The feature may be newborn, created by the MLP or attention module.
Low $s(R)$ and low $s(M)$ , $s(A)$ : The feature cannot be easily explained by maximum cosine similarity alone.

The paper addresses the challenge of long-range feature matching by performing short-range matching in consecutive layers and composing the resulting transformations to construct flow graphs. These graphs trace the evolution of a feature's semantic properties throughout the model.

The authors conduct experiments using the Gemma 2 2B model and the Gemma Scope SAE pack, along with LLama Scope. They analyze how residual features emerge, propagate, and can be manipulated across model layers. The experiments aim to determine how features originate in different model components, assess whether deactivating a predecessor feature deactivates its descendant, and use these insights to steer the model's generation toward or away from specific topics.

The results of the experiments indicate that:

Similarity of linear directions is a good proxy for activation correlation.
The structure of feature groups differs across layers, reflecting information processing within the model.
The top1 similarity provides valuable information about causal dependencies.
Multi-layer interventions affect the model more than single-layer approaches in achieving the desired outcome.

The paper concludes by highlighting the potential of the proposed method for identifying and interpreting the computational graph of a model, enabling precise control over the model's internal structure and opening new avenues for zero-shot steering.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/nlp_ceo/status/1887824086216155156

https://twitter.com/arXivGPT/status/1888287519722086489

https://twitter.com/arXivGPT/status/1889012461010850145

https://twitter.com/TheTuringPost/status/1889101598271639696