- The paper introduces a data-free cosine similarity approach to map SAE feature evolution across layers, revealing hidden computational pathways.
- It constructs flow graphs that decode how features are processed by MLP and attention modules, highlighting feature birth and refinement.
- The research demonstrates that multi-layer interventions improve model steering and control over text generation compared to single-layer methods.
The paper introduces a methodology for systematically analyzing and mapping features across layers in LLMs. It extends previous work on inter-layer feature connections, providing a more granular understanding of feature evolution and enabling targeted steering of model behavior. The approach utilizes a data-free cosine similarity technique to trace how SAE (Sparse Autoencoder) features persist, transform, or emerge across different layers and modules. This results in flow graphs that offer mechanistic insights into model computations and allow for direct manipulation of text generation through feature amplification or suppression.
The authors highlight three main contributions:
- Cross-Layer Feature Evolution: The paper leverages pre-trained SAEs to isolate interpretable, monosemantic directions and then uses cosine similarity between decoder weights to track how these directions evolve across layers. This reveals patterns of feature birth and refinement that are not apparent in single-layer analyses.
- Mechanistic Properties of Flow Graph: By constructing a flow graph, the method uncovers computational pathways that resemble circuits, where MLP and attention modules introduce or modify features.
- Multi-Layer Model Steering: The research demonstrates that flow graphs can enhance model steering by targeting multiple SAE features simultaneously, offering improved understanding and control over the steering outcome.
The methodology builds upon the linear representation hypothesis, which posits that hidden states, h∈Rd, can be represented as sparse linear combinations of features, f∈Rd, lying in linear subspaces F⊂Rd.
h: hidden states
Rd: d-dimensional real vector space
f: features
F: linear subspaces
SAEs are employed to decompose the model's hidden state into a sparse weighted sum of interpretable features. Given a collection of one-dimensional features Fi(P) learned by an SAE at position P in the model, the SAE can be represented as:
z=σ(Wench+benc)
z: feature activation
σ: nonlinear activation function
Wenc: encoder weights
h: model's hidden state
benc: encoder bias
h′=Wdecz+bdec
h′: SAE's reconstruction of the hidden state
Wdec: decoder weights
bdec: decoder bias
The SAEs are trained to reconstruct model hidden states while enforcing sparse feature activations, using a loss function:
L=Lrec(h,h′)+Lreg(z)
L: loss
Lrec: reconstruction loss
Lreg: regularization loss
where Lrec is typically ∣h−h′∣2, and Lreg(z) is an l0 proxy.
The paper also discusses the use of JumpReLU and Top-K activations to control sparsity and introduces transcoders as interpretable approximations of MLPs.
To identify similar features between different layers, a permutation matrix P(A→B) is defined to map feature indices from layer A to layer B:
P(A→B)=P∈Π∣F∣argmini=1∑d∣∣Wdeci,:(B)−Wdeci,:(A)P(A→B)∣∣2
P(A→B): permutation matrix from layer A to layer B
Π∣F∣: set of permutation matrices of size ∣F∣×∣F∣
Wdec(A): decoder weights of the SAE trained on the residual stream after layer A
Wdec(B): decoder weights of the SAE trained on the residual stream after layer B
The method in this paper focuses on finding shared features by discovering a mapping TA→B:F(A)→F(B) between features at different levels of the model. It employs cosine similarity between decoder weights as a similarity metric. Given a feature embedding f∈Rd trained at position A and decoder weights W(B)∈Rd×∣F∣ trained at position B, the matched feature index is found by:
j=jargmax(f⋅Wdecj(B))
The transformation T(A→B) is then defined as:
T(A→B)=Ix>0(topk(fTWdec(B)))
T(A→B): transformation from position A to position B
Ix>0: indicator function
topk: zeroes out values below the kth order statistic
The approach tracks feature evolution by considering four main computational points in a transformer layer: the layer output RL, the MLP output M, the attention output A, and the previous layer output RL−1. By computing the similarity s(P) between a target feature and each of these points, the method infers how the feature relates to the previous layer or modules.
s(P)=maxj(f⋅Wdecj(P))
s(P): similarity between the target feature and point P
f: feature embedding
Wdec(P): decoder weights at point P
Based on these similarity scores, the origin and transformation of features are categorized into four scenarios:
- High s(R) and low s(M), s(A): The feature likely existed in RL−1 and was translated to RL.
- High s(R) and high s(M) or s(A): The feature was likely processed by the MLP or attention module.
- Low s(R) but high s(M) or s(A): The feature may be newborn, created by the MLP or attention module.
- Low s(R) and low s(M), s(A): The feature cannot be easily explained by maximum cosine similarity alone.
The paper addresses the challenge of long-range feature matching by performing short-range matching in consecutive layers and composing the resulting transformations to construct flow graphs. These graphs trace the evolution of a feature's semantic properties throughout the model.
The authors conduct experiments using the Gemma 2 2B model and the Gemma Scope SAE pack, along with LLama Scope. They analyze how residual features emerge, propagate, and can be manipulated across model layers. The experiments aim to determine how features originate in different model components, assess whether deactivating a predecessor feature deactivates its descendant, and use these insights to steer the model's generation toward or away from specific topics.
The results of the experiments indicate that:
- Similarity of linear directions is a good proxy for activation correlation.
- The structure of feature groups differs across layers, reflecting information processing within the model.
- The top1 similarity provides valuable information about causal dependencies.
- Multi-layer interventions affect the model more than single-layer approaches in achieving the desired outcome.
The paper concludes by highlighting the potential of the proposed method for identifying and interpreting the computational graph of a model, enabling precise control over the model's internal structure and opening new avenues for zero-shot steering.