Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class-level Semantic Modulation (CSM)

Updated 28 March 2026
  • Class-level Semantic Modulation is a technique that isolates and manipulates semantic modules in LLMs using sparse autoencoders, coactivation clustering, and linear interventions.
  • The method employs ablation, amplification, and composition operations to steer semantic classes like 'country' and 'relation' with high precision.
  • Empirical evaluations, such as achieving up to 96% steering success in country-related tasks, highlight CSM’s potential for controllable and interpretable LLM outputs.

Class-level Semantic Modulation (CSM) refers to the method of identifying, isolating, and directly manipulating modular semantic components within LLMs, enabling targeted interventions at the class level (e.g., "country," "currency," "relation"). This approach leverages sparse autoencoders (SAEs) to recover monosemantic neural features, cluster them via coactivation patterns, and construct composable “semantic modules.” Through lightweight, layer-wise manipulations—ablation, amplification, and superposition—CSM enables precise, context-consistent semantic steering, with empirical demonstration of high efficacy on tasks such as country-relation transformations in LLMs (Deng et al., 22 Jun 2025).

1. Identification of Semantic Modules using Sparse Autoencoders

Semantic modules are recovered from transformer LLMs by training SAEs on each residual-stream activation xRdmodelx_\ell \in \mathbb{R}^{d_\text{model}} at layer \ell. The SAE architecture employed is a JumpReLU autoencoder of code dimension dsae=16,384d_\text{sae}=16,384 with encoder and decoder matrices WencRdsae×dmodelW_\text{enc} \in \mathbb{R}^{d_\text{sae} \times d_\text{model}} and WdecRdmodel×dsaeW_\text{dec} \in \mathbb{R}^{d_\text{model} \times d_\text{sae}}. The encoding ϕ=Wencx\phi_\ell = W_\text{enc}\cdot x_\ell is passed through a nonnegativity-enforcing JumpReLU, and the SAE is trained to minimize: L(Wenc,Wdec)=E[xWdecϕ22]+λϕ1L(W_\text{enc}, W_\text{dec}) = \mathbb{E}\left[\|x_\ell - W_\text{dec}\cdot \phi_\ell\|_2^2\right] + \lambda\|\phi_\ell\|_1 with sparsity penalty λ\lambda, yielding rare, semantically pure feature activations.

A small set of target prompts is used to obtain per-prompt SAE activations. Coactivation clustering is then performed over the resulting tensor ΦRT×dsae\Phi_\ell \in \mathbb{R}^{T \times d_\text{sae}}:

  1. Top-kk features (typically k=5k=5) are selected per token across prompts.
  2. Nodes—all selected features at each layer—are linked into a directed graph where edges connect features at adjacent layers with Pearson correlation above τcorr=0.9\tau_\text{corr}=0.9.
  3. High-density (generic) features, as judged by activation density from Neuronpedia, are pruned (d,iτdensity=0.01d_{\ell,i} \leq \tau_\text{density}=0.01).
  4. Weakly connected components in the resulting graph are extracted via BFS.

Empirically, for each prompt set, \sim70 such components are recovered, with 2–3 exhibiting dominant causal impact on the output distribution (Deng et al., 22 Jun 2025).

2. Representation and Projection of Semantic Modules

For a semantic class cc (e.g., "China" or "capital"), a module is constructed by intersecting component feature indices across prompts: ScS_c. A binary mask mc{0,1}dsaem^c \in \{0,1\}^{d_\text{sae}} encodes membership in ScS_c. The module’s contribution at any residual zz_\ell is: Pc(z)=Wdec(mcϕ)P_c(z_\ell) = W_\text{dec} \left(m^c \odot \phi_\ell\right) where ϕ=Wencz\phi_\ell = W_\text{enc} \cdot z_\ell. Equivalently, the module projection operator is: Pc=Wdec(iScEiEi)WencP_c = W_\text{dec} \left( \sum_{i \in S_c} E_i E_i^\top \right) W_\text{enc} where EiE_i are standard basis vectors. This representation enables direct linear manipulation of the corresponding class in the model's residual stream.

3. Interventional Operations: Ablation, Amplification, and Composition

CSM admits three key operations at the module level:

  • Ablation: To suppress class cc at layer \ell,

z=zαcPc(z)z'_\ell = z_\ell - \alpha_c P_c(z_\ell)

with ablation coefficient αc\alpha_c (e.g., αc=0.10\alpha_c=0.10 for country modules).

  • Amplification: To enhance class cc,

z=z+βcPc(z)z'_\ell = z_\ell + \beta_c P_c(z_\ell)

with βc\beta_c (e.g., βc=0.10\beta_c=0.10 for country, $0.45$ for relation modules).

These manipulations are layer-wise and applied simultaneously to all layers in which the module is detected.

  • Composition: Multiple classes (e.g., in-prompt and target country–relation pairs) are steered by superposition:

z=zαcinPcin(z)αrinPrin(z)+βctgtPctgt(z)+βrtgtPrtgt(z)z''_\ell = z_\ell - \alpha_{c_\text{in}} P_{c_\text{in}}(z_\ell) - \alpha_{r_\text{in}} P_{r_\text{in}}(z_\ell) + \beta_{c_\text{tgt}} P_{c_\text{tgt}}(z_\ell) + \beta_{r_\text{tgt}} P_{r_\text{tgt}}(z_\ell)

No nonlinearity is added at the intervention; the model proceeds with its standard forward pass.

4. Empirical Causal Analysis and Layer-wise Findings

CSM effectiveness is evaluated via steering success rate: the proportion of prompts for which the top next-token matches the intended output under intervention. Reported metrics on Gemma 2 2B are 96% for country steering, 92% for relation steering, and 90% for compound country–relation steering.

Causal importance of features is quantified by the KL divergence between original and ablated next-token distributions KL(PorigPablate)\mathrm{KL}\left(P_\text{orig} \parallel P_\text{ablate}\right). Features with larger KL under ablation are considered causally important. Layer analysis shows:

  • Country modules predominantly appear in the earliest layer (8/10 in =1\ell=1), often spanning early to mid layers.
  • Relation modules are typically localized to deeper layers (14\ell \geq 14).
  • In relation modules, deeper layers yield systematically greater causal impact (positive correlation between layer depth and post-ablation KL), a pattern not observed for country modules (Deng et al., 22 Jun 2025).

5. Generality, Modularity, and Future Directions

The CSM framework demonstrates that class-level knowledge in LLMs is modular: sparse, context-stable, and linearly composable. Only a handful (often 10\ll 10) of SAE features per module need to be manipulated at each layer for effective class-level modulation.

The methodological pipeline involves (i) training sparse autoencoders, (ii) assembling prompt-specific coactivation graphs, (iii) density-based pruning, (iv) connected component extraction, and (v) direct linear intervention. This framework generalizes to any well-defined semantic class for which prompt exemplars can be written. A plausible implication is that CSM applies equally to sentiment (positive/negative), tense (past/future), or object-category distinctions, potentially enabling broad-spectrum, low-overhead plug-and-play semantic steering in LLMs.

6. Broader Implications and Limitations

These findings support the conclusion that knowledge about entire semantic classes is encoded in modular, composable, and context-consistent neural features. Class-level Semantic Modulation thus offers a transparent, interpretable, and computationally efficient mechanism for intervening in LLM behavior—overriding, redirecting, or combining semantic information in a controlled manner. However, its practical scope depends on robust SAE feature recovery, reliable module correspondence, and the tractability of module identification for arbitrarily complex semantic classes (Deng et al., 22 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-level Semantic Modulation (CSM).