Papers
Topics
Authors
Recent
Search
2000 character limit reached

FGAA: Feature Guided Activation Additions

Updated 8 March 2026
  • FGAA is a method for interpretable activation steering that uses a sparse autoencoder’s latent space to precisely guide large language model behaviors.
  • It improves upon prior techniques like CAA and SAE-TS by programmatically filtering and selecting top-k, semantically relevant features to reduce unintended effects.
  • The approach maintains grammatical and semantic coherence under moderate steering scales, as evidenced by improved Behavioral-Coherence Scores in model evaluations.

Feature Guided Activation Additions (FGAA) is a method for interpretable, precise, and effective activation steering of large neural LLMs. FGAA operates in the sparse latent space of an autoencoder fitted to model activations, enabling targeted behavioral interventions by selecting a carefully filtered subset of semantically interpretable features and mapping these back to model space through a linear effect approximator. FGAA competes with and extends prior approaches such as Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS) by enhancing controllability, interpretability, and minimization of unintended side effects during steering (Soo et al., 17 Jan 2025).

1. Theoretical Foundation and Mathematical Formulation

FGAA is defined by its workflow for constructing steering vectors in the latent feature space of a sparse autoencoder (SAE) fitted to a model’s hidden states. Let hl(x)Rdmodelh_l(x) \in \mathbb{R}^{d_\text{model}} denote the residual stream at layer ll for input xx, and f:RdmodelRdsaef: \mathbb{R}^{d_\text{model}} \to \mathbb{R}^{d_\text{sae}} the SAE encoder.

The method starts with two disjoint sets: X+X^+ for positive (desired behavior) and XX^- for negative (undesired behavior) examples. Compute the contrastive difference in SAE feature space:

vdiff=1X+xX+f(hl(x))1XxXf(hl(x))v_\text{diff} = \frac{1}{|X^+|}\sum_{x \in X^+} f(h_l(x)) - \frac{1}{|X^-|} \sum_{x \in X^-} f(h_l(x))

To produce a sparse, high-precision steering target, three filtering stages are applied:

  1. Density Filtering: Zero SAE features with activation density ρ(i)\rho(i) above a threshold θ=0.01\theta=0.01.
  2. BOS Feature Suppression: Zero any feature associated with the beginning-of-sequence (BOS) token.
  3. Top-k Selection: Select n1n_1 largest positive activations (and, if desired, n2n_2 largest negatives).

Let vtargetv_\text{target} be the resulting L₁-normalized sparse SAE feature vector.

The pre-trained linear effect approximator with weights MRdmodel×dsaeM \in \mathbb{R}^{d_\text{model} \times d_\text{sae}} and bias bRdsaeb \in \mathbb{R}^{d_\text{sae}} maps xy=xM+bx \mapsto y = x M + b, where yy estimates the change in feature space from perturbing hlh_l by xx. FGAA's final steering vector voptv_\text{opt} in model space is:

vopt=WvtargetWvtarget2WbWb2,where  W=Mv_\text{opt} = \frac{W v_\text{target}}{\|W v_\text{target}\|_2} - \frac{W b}{\|W b\|_2}, \quad \text{where} \; W = M^\top

At each generation step, the hidden state is updated as

hlhl+αvopth_l \leftarrow h_l + \alpha v_\text{opt}

with α\alpha the user-tunable steering scale.

2. Comparison with Prior Activation Steering Techniques

FGAA addresses limitations in existing steering strategies as summarized in the table below:

Method Steering Domain Interpretability Feature Selection
CAA Hidden-state None None
SAE Decoder SAE (single feature) Feature labeling possible Manual
SAE-TS SAE (single feature) Feature labeling possible Single feature
FGAA Sparse SAE latent High Top-k/filt. programmatic

CAA creates a noninterpretable mixture of hidden directions without selection, leading to entangled behavioral effects. SAE-TS is limited by its dependence on single feature selection, which cannot capture composite concepts. FGAA extends SAE-TS by supporting multi-feature targets and rigorously filtering for interpretable, behaviorally specific features. This filtering process allows FGAA to achieve higher precision and maintain output quality (Soo et al., 17 Jan 2025).

3. Impact on Model Output and Coherence

FGAA is distinguished by its capacity to maintain grammatical and semantic coherence during strong behavioral steering. Filtering and pruning—removal of non-concept-specific, high-density, and BOS-associated features—reduce spurious activations that otherwise degrade fluency and correctness. L₁ normalization of vtargetv_\text{target} further ensures balanced perturbations and avoids the instability of "one-hot" interventions in feature space.

Empirical BCS (Behavioral-Coherence Score) results show that FGAA-steered outputs remain robust even at moderate α\alpha, in contrast to CAA where unfiltered interventions introduce artifacts at lower steering scales (Soo et al., 17 Jan 2025). A theoretical interpretation is that steering primarily along interpretable, low-density concept axes reduces side effects and aligns with human-understandable behaviors.

4. Experimental Evaluation and Quantitative Results

Evaluations using Gemma-2-2B and Gemma-2-9B models applied FGAA at layer 12 for nine distinct behavioral tasks (e.g., Anger, Christian Evangelist, French, Praise). Metrics included:

  • Behavioral Score (B): measures extent of steering effect via external model ratings, scaled to [0,1][0, 1]
  • Coherence Score (C): quantifies grammatical and semantic correctness
  • BCS = B × C (combined metric)

Results (selected means):

Model CAA SAE SAE-TS FGAA
Gemma-2-2B 0.2201 0.1404 0.3650 0.4702
Gemma-2-9B 0.2729 0.2267 0.3467 0.3979

FGAA yields relative improvements over CAA by ∼115% (2B) and ∼45% (9B), and over SAE-TS by ∼29% (2B) and ∼15% (9B). Pareto analyses demonstrate that for a fixed coherence loss, FGAA attains much higher behavioral scores, pushing the steering-performance frontier upward.

5. Trade-offs, Limitations, and Prospective Research

FGAA inherits a set of critical trade-offs intrinsic to activation steering. Model perplexity and task accuracy remain near baseline for α40\alpha \lesssim 40; above this point, model capabilities degrade across all evaluated steering approaches. FGAA displays no artificial "low-α\alpha performance bump" found in baselines, confirming that its vectors correspond to focused, concept-specific transformations.

The method's effectiveness depends on the properties of the underlying SAE: wider, more monosemantic autoencoders could extend FGAA's precision. The parameters for feature count selection (n1n_1, n2n_2) are task-dependent, currently requiring manual or semi-empirical adjustment. No explicit regularization or saliency measure for filtered features is currently offered, suggesting room for automated hyperparameter selection and interpretability diagnostics.

Future directions include integrating constraint optimization to further reduce the loss of model generality at large α\alpha, multi-objective steering for improved limitation of side effects, and deployment in safety-critical environments such as bias mitigation, jailbreak prevention, or sycophancy modulation (Soo et al., 17 Jan 2025).

6. Relation to Recent Developments in Activation Editing

A related approach in vision-LLMs, as in the AFTER framework, exploits factual-guided editing using both general factual steering vectors and query-adaptive offsets (Wang et al., 5 Jan 2026). This framework targets specific failure modes such as object hallucination by directly shifting activation patterns toward empirically grounded, textually validated semantic facts. A plausible implication is that combining the feature-guided selection principles from FGAA with factual and query-adaptive editing may further improve precision and reliability in both language-only and multimodal settings.

FGAA thus represents a significant step toward interpretable, tunable, and reliable behavioral control in LLMs, offering both practical and theoretical advantages over prior generic or single-feature approaches.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Feature Guided Activation Additions (FGAA).