- The paper introduces a method to identify semantic modules in LLMs by analyzing sparse autoencoder coactivations.
- The approach uses graph construction and density-based pruning to isolate interpretable features for country and relation tasks.
- The paper demonstrates causal validation by ablating and amplifying components, achieving up to 96% success in output steering.
This paper introduces a method to identify and manipulate "semantic components" within LLMs by analyzing the coactivation patterns of sparse autoencoder (SAE) features. The core idea is that by observing which interpretable features (derived from SAEs) activate together across different layers in response to a prompt, one can uncover modular subnetworks corresponding to specific concepts like countries or relations (e.g., "capital of").
The methodology involves several steps:
- Activation Collection: For a given input prompt, activations from the residual stream of each layer of an LLM (Gemma 2 2B in this paper) are passed through pre-trained SAEs. This yields a sparse representation of activations for each layer, where dsae​=16384.
- Feature Selection: To manage computational load, only the top k=5 most activated SAE features at each token position within each layer are selected.
- Graph Construction: A directed graph is built where nodes represent the selected SAE features (ℓ,i) (feature i in layer ℓ). Edges are drawn between features in adjacent layers (ℓ,i) and (ℓ+1,j) if the Pearson correlation of their activation patterns across the prompt's tokens exceeds a threshold τcorr​=0.9.
- Density-Based Pruning: Features that activate very frequently across diverse contexts (high activation density, dℓ,i​>0.01, based on Neuronpedia data) are considered generic and are pruned from the graph. This step aims to retain only sparse, more interpretable features.
- Component Identification: Weakly connected components are identified in the pruned graph using a Breadth-First Search (BFS) algorithm. These components are hypothesized to represent semantic modules.
- Causal Validation: The functional role of these components is tested by intervening in the model's forward pass. This involves ablating (setting activations to zero) or amplifying (increasing activations) the SAE features within a component and observing the change in the model's next-token predictions, quantified by KL divergence.
Experiments focused on country-relation tasks (capital, currency, language) for ten countries. The key findings are:
- Semantic Coherence and Context Consistency: The method successfully identified components that were semantically coherent (features within a component related to a specific country or relation) and consistent across different prompts. For instance, the "China" component was similar whether the prompt was about its capital or currency, and "language" components were similar across different countries.
- Component Steering:
- Country Steering: Ablating the component for the country mentioned in the prompt (e.g., "China") and amplifying a target country component (e.g., "Nigeria") successfully made the LLM generate answers corresponding to the target country (e.g., outputting "Abuja" for "What is the capital of China?"). This achieved a 96% success rate.
- Relation Steering: Similarly, ablating an in-prompt relation (e.g., "capital") and amplifying a target relation (e.g., "currency") changed the model's output accordingly (e.g., answering with "Yuan" for a question about China's capital if the currency component was amplified). This had a 92% success rate.
- Composite Steering: Manipulating both country and relation components simultaneously also worked. For example, asking for "capital of China" but ablating "China" and "capital" components while amplifying "Nigeria" and "currency" components led the model to output "Naira", with a 90% overall success rate for such composite steering.
- Component Organization:
- Country components (representing concrete entities) tend to emerge from the very first layer of the network.
- Relation components (representing more abstract concepts) are concentrated in later layers.
- Within relation components, features from deeper layers tend to have a stronger causal impact on the model's output.
The authors conclude that these findings suggest a modular organization of knowledge in LLMs. The method offers a lightweight approach for mechanistic interpretability and targeted model manipulation without requiring exhaustive circuit tracing. They highlight the potential for improved transparency and control of LLMs.
Limitations include the focus on country-relation tasks, the use of standard JumpReLU SAEs (rather than more advanced versions), and the exclusion of high-density features from the main analysis, whose role remains to be explored. The work was primarily done on Gemma 2 2B, with similar results replicated on Gemma 2 9B.