Class-level Semantic Modulation (CSM)

Updated 28 March 2026

Class-level Semantic Modulation is a technique that isolates and manipulates semantic modules in LLMs using sparse autoencoders, coactivation clustering, and linear interventions.
The method employs ablation, amplification, and composition operations to steer semantic classes like 'country' and 'relation' with high precision.
Empirical evaluations, such as achieving up to 96% steering success in country-related tasks, highlight CSM’s potential for controllable and interpretable LLM outputs.

Class-level Semantic Modulation (CSM) refers to the method of identifying, isolating, and directly manipulating modular semantic components within LLMs, enabling targeted interventions at the class level (e.g., "country," "currency," "relation"). This approach leverages sparse autoencoders (SAEs) to recover monosemantic neural features, cluster them via coactivation patterns, and construct composable “semantic modules.” Through lightweight, layer-wise manipulations—ablation, amplification, and superposition—CSM enables precise, context-consistent semantic steering, with empirical demonstration of high efficacy on tasks such as country-relation transformations in LLMs (Deng et al., 22 Jun 2025).

1. Identification of Semantic Modules using Sparse Autoencoders

Semantic modules are recovered from transformer LLMs by training SAEs on each residual-stream activation $x_\ell \in \mathbb{R}^{d_\text{model}}$ at layer $\ell$ . The SAE architecture employed is a JumpReLU autoencoder of code dimension $d_\text{sae}=16,384$ with encoder and decoder matrices $W_\text{enc} \in \mathbb{R}^{d_\text{sae} \times d_\text{model}}$ and $W_\text{dec} \in \mathbb{R}^{d_\text{model} \times d_\text{sae}}$ . The encoding $\phi_\ell = W_\text{enc}\cdot x_\ell$ is passed through a nonnegativity-enforcing JumpReLU, and the SAE is trained to minimize: $L(W_\text{enc}, W_\text{dec}) = \mathbb{E}\left[\|x_\ell - W_\text{dec}\cdot \phi_\ell\|_2^2\right] + \lambda\|\phi_\ell\|_1$ with sparsity penalty $\lambda$ , yielding rare, semantically pure feature activations.

A small set of target prompts is used to obtain per-prompt SAE activations. Coactivation clustering is then performed over the resulting tensor $\Phi_\ell \in \mathbb{R}^{T \times d_\text{sae}}$ :

Top- $k$ features (typically $k=5$ ) are selected per token across prompts.
Nodes—all selected features at each layer—are linked into a directed graph where edges connect features at adjacent layers with Pearson correlation above $\tau_\text{corr}=0.9$ .
High-density (generic) features, as judged by activation density from Neuronpedia, are pruned ( $d_{\ell,i} \leq \tau_\text{density}=0.01$ ).
Weakly connected components in the resulting graph are extracted via BFS.

Empirically, for each prompt set, $\sim$ 70 such components are recovered, with 2–3 exhibiting dominant causal impact on the output distribution (Deng et al., 22 Jun 2025).

2. Representation and Projection of Semantic Modules

For a semantic class $c$ (e.g., "China" or "capital"), a module is constructed by intersecting component feature indices across prompts: $S_c$ . A binary mask $m^c \in \{0,1\}^{d_\text{sae}}$ encodes membership in $S_c$ . The module’s contribution at any residual $z_\ell$ is: $P_c(z_\ell) = W_\text{dec} \left(m^c \odot \phi_\ell\right)$ where $\phi_\ell = W_\text{enc} \cdot z_\ell$ . Equivalently, the module projection operator is: $P_c = W_\text{dec} \left( \sum_{i \in S_c} E_i E_i^\top \right) W_\text{enc}$ where $E_i$ are standard basis vectors. This representation enables direct linear manipulation of the corresponding class in the model's residual stream.

3. Interventional Operations: Ablation, Amplification, and Composition

CSM admits three key operations at the module level:

Ablation: To suppress class $c$ at layer $\ell$ ,

$z'_\ell = z_\ell - \alpha_c P_c(z_\ell)$

with ablation coefficient $\alpha_c$ (e.g., $\alpha_c=0.10$ for country modules).

Amplification: To enhance class $c$ ,

$z'_\ell = z_\ell + \beta_c P_c(z_\ell)$

with $\beta_c$ (e.g., $\beta_c=0.10$ for country, $0.45$ for relation modules).

These manipulations are layer-wise and applied simultaneously to all layers in which the module is detected.

Composition: Multiple classes (e.g., in-prompt and target country–relation pairs) are steered by superposition:

$z''_\ell = z_\ell - \alpha_{c_\text{in}} P_{c_\text{in}}(z_\ell) - \alpha_{r_\text{in}} P_{r_\text{in}}(z_\ell) + \beta_{c_\text{tgt}} P_{c_\text{tgt}}(z_\ell) + \beta_{r_\text{tgt}} P_{r_\text{tgt}}(z_\ell)$

No nonlinearity is added at the intervention; the model proceeds with its standard forward pass.

4. Empirical Causal Analysis and Layer-wise Findings

CSM effectiveness is evaluated via steering success rate: the proportion of prompts for which the top next-token matches the intended output under intervention. Reported metrics on Gemma 2 2B are 96% for country steering, 92% for relation steering, and 90% for compound country–relation steering.

Causal importance of features is quantified by the KL divergence between original and ablated next-token distributions $\mathrm{KL}\left(P_\text{orig} \parallel P_\text{ablate}\right)$ . Features with larger KL under ablation are considered causally important. Layer analysis shows:

Country modules predominantly appear in the earliest layer (8/10 in $\ell=1$ ), often spanning early to mid layers.
Relation modules are typically localized to deeper layers ( $\ell \geq 14$ ).
In relation modules, deeper layers yield systematically greater causal impact (positive correlation between layer depth and post-ablation KL), a pattern not observed for country modules (Deng et al., 22 Jun 2025).

5. Generality, Modularity, and Future Directions

The CSM framework demonstrates that class-level knowledge in LLMs is modular: sparse, context-stable, and linearly composable. Only a handful (often $\ll 10$ ) of SAE features per module need to be manipulated at each layer for effective class-level modulation.

The methodological pipeline involves (i) training sparse autoencoders, (ii) assembling prompt-specific coactivation graphs, (iii) density-based pruning, (iv) connected component extraction, and (v) direct linear intervention. This framework generalizes to any well-defined semantic class for which prompt exemplars can be written. A plausible implication is that CSM applies equally to sentiment (positive/negative), tense (past/future), or object-category distinctions, potentially enabling broad-spectrum, low-overhead plug-and-play semantic steering in LLMs.

6. Broader Implications and Limitations

These findings support the conclusion that knowledge about entire semantic classes is encoded in modular, composable, and context-consistent neural features. Class-level Semantic Modulation thus offers a transparent, interpretable, and computationally efficient mechanism for intervening in LLM behavior—overriding, redirecting, or combining semantic information in a controlled manner. However, its practical scope depends on robust SAE feature recovery, reliable module correspondence, and the tractability of module identification for arbitrarily complex semantic classes (Deng et al., 22 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-level Semantic Modulation (CSM).

Class-level Semantic Modulation (CSM)

1. Identification of Semantic Modules using Sparse Autoencoders

2. Representation and Projection of Semantic Modules

3. Interventional Operations: Ablation, Amplification, and Composition

4. Empirical Causal Analysis and Layer-wise Findings

5. Generality, Modularity, and Future Directions

6. Broader Implications and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Class-level Semantic Modulation (CSM)

1. Identification of Semantic Modules using Sparse Autoencoders

2. Representation and Projection of Semantic Modules

3. Interventional Operations: Ablation, Amplification, and Composition

4. Empirical Causal Analysis and Layer-wise Findings

5. Generality, Modularity, and Future Directions

6. Broader Implications and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research