Class-level Semantic Modulation (CSM)
- Class-level Semantic Modulation is a technique that isolates and manipulates semantic modules in LLMs using sparse autoencoders, coactivation clustering, and linear interventions.
- The method employs ablation, amplification, and composition operations to steer semantic classes like 'country' and 'relation' with high precision.
- Empirical evaluations, such as achieving up to 96% steering success in country-related tasks, highlight CSM’s potential for controllable and interpretable LLM outputs.
Class-level Semantic Modulation (CSM) refers to the method of identifying, isolating, and directly manipulating modular semantic components within LLMs, enabling targeted interventions at the class level (e.g., "country," "currency," "relation"). This approach leverages sparse autoencoders (SAEs) to recover monosemantic neural features, cluster them via coactivation patterns, and construct composable “semantic modules.” Through lightweight, layer-wise manipulations—ablation, amplification, and superposition—CSM enables precise, context-consistent semantic steering, with empirical demonstration of high efficacy on tasks such as country-relation transformations in LLMs (Deng et al., 22 Jun 2025).
1. Identification of Semantic Modules using Sparse Autoencoders
Semantic modules are recovered from transformer LLMs by training SAEs on each residual-stream activation at layer . The SAE architecture employed is a JumpReLU autoencoder of code dimension with encoder and decoder matrices and . The encoding is passed through a nonnegativity-enforcing JumpReLU, and the SAE is trained to minimize: with sparsity penalty , yielding rare, semantically pure feature activations.
A small set of target prompts is used to obtain per-prompt SAE activations. Coactivation clustering is then performed over the resulting tensor :
- Top- features (typically ) are selected per token across prompts.
- Nodes—all selected features at each layer—are linked into a directed graph where edges connect features at adjacent layers with Pearson correlation above .
- High-density (generic) features, as judged by activation density from Neuronpedia, are pruned ().
- Weakly connected components in the resulting graph are extracted via BFS.
Empirically, for each prompt set, 70 such components are recovered, with 2–3 exhibiting dominant causal impact on the output distribution (Deng et al., 22 Jun 2025).
2. Representation and Projection of Semantic Modules
For a semantic class (e.g., "China" or "capital"), a module is constructed by intersecting component feature indices across prompts: . A binary mask encodes membership in . The module’s contribution at any residual is: where . Equivalently, the module projection operator is: where are standard basis vectors. This representation enables direct linear manipulation of the corresponding class in the model's residual stream.
3. Interventional Operations: Ablation, Amplification, and Composition
CSM admits three key operations at the module level:
- Ablation: To suppress class at layer ,
with ablation coefficient (e.g., for country modules).
- Amplification: To enhance class ,
with (e.g., for country, $0.45$ for relation modules).
These manipulations are layer-wise and applied simultaneously to all layers in which the module is detected.
- Composition: Multiple classes (e.g., in-prompt and target country–relation pairs) are steered by superposition:
No nonlinearity is added at the intervention; the model proceeds with its standard forward pass.
4. Empirical Causal Analysis and Layer-wise Findings
CSM effectiveness is evaluated via steering success rate: the proportion of prompts for which the top next-token matches the intended output under intervention. Reported metrics on Gemma 2 2B are 96% for country steering, 92% for relation steering, and 90% for compound country–relation steering.
Causal importance of features is quantified by the KL divergence between original and ablated next-token distributions . Features with larger KL under ablation are considered causally important. Layer analysis shows:
- Country modules predominantly appear in the earliest layer (8/10 in ), often spanning early to mid layers.
- Relation modules are typically localized to deeper layers ().
- In relation modules, deeper layers yield systematically greater causal impact (positive correlation between layer depth and post-ablation KL), a pattern not observed for country modules (Deng et al., 22 Jun 2025).
5. Generality, Modularity, and Future Directions
The CSM framework demonstrates that class-level knowledge in LLMs is modular: sparse, context-stable, and linearly composable. Only a handful (often ) of SAE features per module need to be manipulated at each layer for effective class-level modulation.
The methodological pipeline involves (i) training sparse autoencoders, (ii) assembling prompt-specific coactivation graphs, (iii) density-based pruning, (iv) connected component extraction, and (v) direct linear intervention. This framework generalizes to any well-defined semantic class for which prompt exemplars can be written. A plausible implication is that CSM applies equally to sentiment (positive/negative), tense (past/future), or object-category distinctions, potentially enabling broad-spectrum, low-overhead plug-and-play semantic steering in LLMs.
6. Broader Implications and Limitations
These findings support the conclusion that knowledge about entire semantic classes is encoded in modular, composable, and context-consistent neural features. Class-level Semantic Modulation thus offers a transparent, interpretable, and computationally efficient mechanism for intervening in LLM behavior—overriding, redirecting, or combining semantic information in a controlled manner. However, its practical scope depends on robust SAE feature recovery, reliable module correspondence, and the tractability of module identification for arbitrarily complex semantic classes (Deng et al., 22 Jun 2025).