Papers
Topics
Authors
Recent
Search
2000 character limit reached

CAST: Conditional Activation Steering in LMs

Updated 26 February 2026
  • Conditional Activation Steering (CAST) is a method that applies context-sensitive, rule-based interventions in the hidden activation space of language models using programmatic gates and hypernetworks.
  • It leverages techniques such as difference-in-means, PCA, and kNN-based gating to extract and apply steering vectors only when input conditions or hidden states match predefined criteria.
  • CAST improves control over output behaviors like content refusal and instruction adherence, yielding higher compliance rates and minimal side effects compared to global activation modifications.

Conditional Activation Steering (CAST) is a family of inference-time interventions for LLMs and masked diffusion LLMs (MDLMs) that enables context-sensitive, rule-based, or instruction-specific control over generation by dynamically manipulating internal activations. Unlike unconditional (global) activation steering, which indiscriminately alters model behavior across all inputs, CAST selectively applies steering vectors—additive or multiplicative interventions in hidden activation space—only when the input’s contextual features or hidden states match user-defined or learned criteria. This mechanism, realized via programmatic gates, conditional controllers, or architectural hypernetworks, offers precise control over refusal behaviors, adherence to output constraints, robust cross-task/low-resource transfer, and steering of logical reasoning, with minimal computational overhead and without model parameter updates.

1. Motivation and Conceptual Foundations

The principal motivation for CAST arises from the limitations of standard activation steering, where a single steering vector applied globally at some layer and strength α\alpha can induce desired behaviors (e.g., refusal, output format) but lacks specificity, leading to degraded performance (e.g., over-refusal, loss of compliance on benign inputs) (Lee et al., 2024, Hegazy et al., 22 May 2025). In content moderation, legal compliance, or instruction-following, interventions must be activation- or context-dependent, mapping user- or application-specified "if-then" rules onto model-internal control mechanisms (Lee et al., 2024, Stolfo et al., 2024). CAST achieves this by introducing learnable or rule-defined gates that evaluate activations for input category membership or task relevance, conditioning the subsequent steering operation on that detection.

2. Extraction of Steering and Condition Vectors

CAST operationalizes interventions through vectors in hidden space, extracted using contrastive datasets:

  • For any target behavior or category, prompts are split into D+D^+ (exhibiting the target) and DD^- (not exhibiting the target) (Lee et al., 2024, Shnaidman et al., 30 Dec 2025, Stolfo et al., 2024, Tang et al., 17 Jul 2025).
  • For each sample, layer-wise residual stream activations hi()h^{(\ell)}_i are recorded.
  • Vectors may be derived by:
    • Difference in means: v^()=μ^+()μ^()μ^+()μ^()2\hat{v}^{(\ell)} = \frac{\hat{\mu}_+^{(\ell)} - \hat{\mu}_-^{(\ell)}}{\Vert \hat{\mu}_+^{(\ell)} - \hat{\mu}_-^{(\ell)} \Vert_2}, where μ^±()\hat{\mu}_\pm^{(\ell)} are empirical means across D±D^\pm (Shnaidman et al., 30 Dec 2025, Stolfo et al., 2024).
    • Principal component analysis (PCA): On mean-centered pooled activations, the first principal component defines the condition or behavior direction cc_\ell or vv_\ell (Lee et al., 2024).
    • Delta of in-context vs. zero-shot activations: Cast as CL=1ni=1n(ai,Lfsai,Lzs)C_L = \frac{1}{n} \sum_{i=1}^n(a^{fs}_{i,L} - a^{zs}_{i,L}) for cross-task transfer (Tang et al., 17 Jul 2025).
    • Instruction vectors: Paired samples with and without target instructions yield difference vectors for inferring instruction-following (e.g., format, brevity, word inclusion) (Stolfo et al., 2024).
    • Hypernetworks: A parametric network Hθ(s,x,a(x))H_\theta(s, x, a_\ell(x)) that directly maps an arbitrary steering prompt and context to a custom steering vector Δsx\Delta_s^x (Sun et al., 3 Jun 2025).

Table: Extraction Strategies (across representative works)

Approach Data Requirement Extraction Method
Contrastive Means Labeled D+D^+, DD^- Difference-in-means
PCA / Condition Vector Labeled D+D^+/DD^- 1st Principal Component
In-Context Delta Task demos Few-shot vs. zero-shot
Instructional Steering Contrasted pairs Prompt with/without instr
Hypersteer (Editor’s term) Diverse prompt set Learned hypernetwork

3. Conditional Gating and Programmatic Rule Specification

At inference, application of the steering vector is gated by a learned or rule-based function:

  • Condition similarity: A gate gc()(x)g_c^{(\ell)}(x) at layer \ell compares the cosine similarity sim(h()(x),c)\operatorname{sim}_\ell(h^{(\ell)}(x), c_\ell) to a threshold θ\theta_\ell, using a comparator (>,<). The behavior vector is applied only if gc()(x)=1g_c^{(\ell)}(x)=1 (Lee et al., 2024).
  • Multi-layer OR: Gating across multiple layers is combined via logical OR (max\max) for sensitivity (Lee et al., 2024).
  • kNN-based gating: For ambiguity or "unresponsive" models, CAST^* retrieves kk nearest labeled neighbors in activation space, applying the steering direction based on majority/neighborhood label (Valentino et al., 18 May 2025).
  • Controller architectures: A lightweight MLP or deeper controller observes concatenated activations across select layers and predicts a continuous steering scale α\alpha (and possibly per-layer weights ww_\ell), enabling nuanced, input-dependent steering (Hegazy et al., 22 May 2025). The MLP is trained with regression loss to match target behaviors on labeled data.
  • Instruction-driven and compositional triggers: Multiple instruction vectors, each for a discrete output constraint, can be combined and applied at separate layers for modular control (Stolfo et al., 2024).
  • Hypernetworks: Instead of relying on static vectors, a hypernetwork conditions on both steering prompt and base prompt, producing a per-instance steering vector (Sun et al., 3 Jun 2025). This is especially relevant for large prompt sets and task generalization scenarios.

4. Modes of Application: Steering Domains and Architectures

CAST is applicable to a range of architectures and control scenarios:

  • Autoregressive LLMs: Typically, additive interventions are applied at selected residual-stream layers and specific token positions (e.g., last input token, autoregressive generation steps). Selective steering enables fine-grained instruction-following, content refusal, or bias mitigation (Lee et al., 2024, Stolfo et al., 2024, Valentino et al., 18 May 2025).
  • Masked diffusion LLMs (MDLMs): In these, steering operates during each reverse-diffusion step by intercepting post-MLP residual activations and subtracting projections along the learned steering direction, with scope over prompt, response, or both token sets (Shnaidman et al., 30 Dec 2025).
  • Activation scaling: Instead of vector addition, steering can be multiplicative—each selected activation is rescaled by a learned scalar θ\theta, optionally conditioned dynamically via a probe to generalize across prompt lengths (Stoehr et al., 2024).
  • Cross-task and cross-lingual transfer: Steering directions, extracted from high-resource source tasks using contrastive activation deltas, are injected during target-task inference to yield zero-shot-like transfer without context expansion or parameter update (Tang et al., 17 Jul 2025).
  • Generalization to thousands of behaviors: HyperSteer (hypernetwork CAST variant) trains a transformer to produce steering vectors conditioned on steering prompts, showing nearly linear improvement for out-of-distribution prompt generalization as prompt coverage increases (Sun et al., 3 Jun 2025).

5. Empirical Findings and Evaluative Benchmarks

Empirical results across these works establish:

  • Refusal and compliance control: CAST raises harmful prompt refusal rates to 83–90% (from model averages of 40–80%) while keeping false refusals on benign prompts under 6%, outperforming unconditional steering in all tested models (Lee et al., 2024).
  • Instruction compliance: Instruction-following metrics (e.g., format adherence, brevity, word inclusion) improve by 25pp or more, with CAST vectors enabling adherence without explicit instructions (Stolfo et al., 2024).
  • Cross-task transfer gains: Average accuracy in low-resource tasks rises by 2–5 points over strongest in-context prompting methods; cross-lingual accuracy gains range from 44.9–95.2% depending on language (Tang et al., 17 Jul 2025).
  • Reasoning debiasing: kNN-based CAST delivers up to 15% absolute improvement in logical accuracy for models unresponsive to static steering, reducing "content effect" bias by several-fold (Valentino et al., 18 May 2025).
  • Minimal side effects: Faithfulness diagnostics (KL divergence or win-rates) indicate that steering can be highly localized with little impact on unrelated model capabilities (Stoehr et al., 2024, Lee et al., 2024, Hegazy et al., 22 May 2025).
  • Scalability and generalization: HyperSteer matches or outperforms fine-tuned prompt-based steering for both in-distribution and novel prompts, with per-prompt training cost decreasing as the number of target behaviors grows (Sun et al., 3 Jun 2025).

6. Implementation and Practical Considerations

Key implementation guidelines include:

  • Vector computation is limited to lightweight forward passes (e.g., D++D100|D^+|+|D^-|\sim 100) and requires no parameter update of the base model (Shnaidman et al., 30 Dec 2025).
  • Integration via tensor hooks in frameworks such as PyTorch suffices for most additive and multiplicative interventions.
  • Overhead is minimal: Per-token and per-layer computations scale linearly with batch size and dimension, remaining negligible relative to the base model’s forward pass (Shnaidman et al., 30 Dec 2025, Hegazy et al., 22 May 2025).
  • Hyperparameter selection (steering layers, α\alpha, thresholds) can be automated by grid search or data-driven tuning; practical ranges are documented per model and behavior (Lee et al., 2024, Hegazy et al., 22 May 2025).
  • Compositional steering (injecting multiple vectors at separate layers) is robust to destructive interference if layer assignments are sparse (Stolfo et al., 2024).
  • Deployment requires white-box access: CAST cannot be applied to closed-source APIs that conceal internal activations (Tang et al., 17 Jul 2025, Sun et al., 3 Jun 2025).

7. Limitations and Future Directions

CAST exhibits several notable boundaries:

In summary, Conditional Activation Steering constitutes a rigorously validated, modular, and efficient paradigm for behavioral and output control in foundation models, providing scalable alternatives to global steering and model fine-tuning while enabling nuanced, rules-based, and contextually sensitive interventions across a variety of open problems in large-scale language generation (Lee et al., 2024, Hegazy et al., 22 May 2025, Sun et al., 3 Jun 2025, Shnaidman et al., 30 Dec 2025, Stolfo et al., 2024, Tang et al., 17 Jul 2025, Stoehr et al., 2024, Valentino et al., 18 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional Activation Steering (CAST).