CAST: Conditional Activation Steering in LMs

Updated 26 February 2026

Conditional Activation Steering (CAST) is a method that applies context-sensitive, rule-based interventions in the hidden activation space of language models using programmatic gates and hypernetworks.
It leverages techniques such as difference-in-means, PCA, and kNN-based gating to extract and apply steering vectors only when input conditions or hidden states match predefined criteria.
CAST improves control over output behaviors like content refusal and instruction adherence, yielding higher compliance rates and minimal side effects compared to global activation modifications.

Conditional Activation Steering (CAST) is a family of inference-time interventions for LLMs and masked diffusion LLMs (MDLMs) that enables context-sensitive, rule-based, or instruction-specific control over generation by dynamically manipulating internal activations. Unlike unconditional (global) activation steering, which indiscriminately alters model behavior across all inputs, CAST selectively applies steering vectors—additive or multiplicative interventions in hidden activation space—only when the input’s contextual features or hidden states match user-defined or learned criteria. This mechanism, realized via programmatic gates, conditional controllers, or architectural hypernetworks, offers precise control over refusal behaviors, adherence to output constraints, robust cross-task/low-resource transfer, and steering of logical reasoning, with minimal computational overhead and without model parameter updates.

1. Motivation and Conceptual Foundations

The principal motivation for CAST arises from the limitations of standard activation steering, where a single steering vector applied globally at some layer and strength $\alpha$ can induce desired behaviors (e.g., refusal, output format) but lacks specificity, leading to degraded performance (e.g., over-refusal, loss of compliance on benign inputs) (Lee et al., 2024, Hegazy et al., 22 May 2025). In content moderation, legal compliance, or instruction-following, interventions must be activation- or context-dependent, mapping user- or application-specified "if-then" rules onto model-internal control mechanisms (Lee et al., 2024, Stolfo et al., 2024). CAST achieves this by introducing learnable or rule-defined gates that evaluate activations for input category membership or task relevance, conditioning the subsequent steering operation on that detection.

2. Extraction of Steering and Condition Vectors

CAST operationalizes interventions through vectors in hidden space, extracted using contrastive datasets:

For any target behavior or category, prompts are split into $D^+$ (exhibiting the target) and $D^-$ (not exhibiting the target) (Lee et al., 2024, Shnaidman et al., 30 Dec 2025, Stolfo et al., 2024, Tang et al., 17 Jul 2025).
For each sample, layer-wise residual stream activations $h^{(\ell)}_i$ are recorded.
Vectors may be derived by:
- Difference in means: $\hat{v}^{(\ell)} = \frac{\hat{\mu}_+^{(\ell)} - \hat{\mu}_-^{(\ell)}}{\Vert \hat{\mu}_+^{(\ell)} - \hat{\mu}_-^{(\ell)} \Vert_2}$ , where $\hat{\mu}_\pm^{(\ell)}$ are empirical means across $D^\pm$ (Shnaidman et al., 30 Dec 2025, Stolfo et al., 2024).
- Principal component analysis (PCA): On mean-centered pooled activations, the first principal component defines the condition or behavior direction $c_\ell$ or $v_\ell$ (Lee et al., 2024).
- Delta of in-context vs. zero-shot activations: Cast as $C_L = \frac{1}{n} \sum_{i=1}^n(a^{fs}_{i,L} - a^{zs}_{i,L})$ for cross-task transfer (Tang et al., 17 Jul 2025).
- Instruction vectors: Paired samples with and without target instructions yield difference vectors for inferring instruction-following (e.g., format, brevity, word inclusion) (Stolfo et al., 2024).
- Hypernetworks: A parametric network $D^+$ 0 that directly maps an arbitrary steering prompt and context to a custom steering vector $D^+$ 1 (Sun et al., 3 Jun 2025).

Table: Extraction Strategies (across representative works)

Approach	Data Requirement	Extraction Method
Contrastive Means	Labeled $D^+$ 2, $D^+$ 3	Difference-in-means
PCA / Condition Vector	Labeled $D^+$ 4/ $D^+$ 5	1st Principal Component
In-Context Delta	Task demos	Few-shot vs. zero-shot
Instructional Steering	Contrasted pairs	Prompt with/without instr
Hypersteer (Editor’s term)	Diverse prompt set	Learned hypernetwork

3. Conditional Gating and Programmatic Rule Specification

At inference, application of the steering vector is gated by a learned or rule-based function:

Condition similarity: A gate $D^+$ 6 at layer $D^+$ 7 compares the cosine similarity $D^+$ 8 to a threshold $D^+$ 9, using a comparator (>,<). The behavior vector is applied only if $D^-$ 0 (Lee et al., 2024).
Multi-layer OR: Gating across multiple layers is combined via logical OR ( $D^-$ 1) for sensitivity (Lee et al., 2024).
kNN-based gating: For ambiguity or "unresponsive" models, CAST $D^-$ 2 retrieves $D^-$ 3 nearest labeled neighbors in activation space, applying the steering direction based on majority/neighborhood label (Valentino et al., 18 May 2025).
Controller architectures: A lightweight MLP or deeper controller observes concatenated activations across select layers and predicts a continuous steering scale $D^-$ 4 (and possibly per-layer weights $D^-$ 5), enabling nuanced, input-dependent steering (Hegazy et al., 22 May 2025). The MLP is trained with regression loss to match target behaviors on labeled data.
Instruction-driven and compositional triggers: Multiple instruction vectors, each for a discrete output constraint, can be combined and applied at separate layers for modular control (Stolfo et al., 2024).
Hypernetworks: Instead of relying on static vectors, a hypernetwork conditions on both steering prompt and base prompt, producing a per-instance steering vector (Sun et al., 3 Jun 2025). This is especially relevant for large prompt sets and task generalization scenarios.

4. Modes of Application: Steering Domains and Architectures

CAST is applicable to a range of architectures and control scenarios:

Autoregressive LLMs: Typically, additive interventions are applied at selected residual-stream layers and specific token positions (e.g., last input token, autoregressive generation steps). Selective steering enables fine-grained instruction-following, content refusal, or bias mitigation (Lee et al., 2024, Stolfo et al., 2024, Valentino et al., 18 May 2025).
Masked diffusion LLMs (MDLMs): In these, steering operates during each reverse-diffusion step by intercepting post-MLP residual activations and subtracting projections along the learned steering direction, with scope over prompt, response, or both token sets (Shnaidman et al., 30 Dec 2025).
Activation scaling: Instead of vector addition, steering can be multiplicative—each selected activation is rescaled by a learned scalar $D^-$ 6, optionally conditioned dynamically via a probe to generalize across prompt lengths (Stoehr et al., 2024).
Cross-task and cross-lingual transfer: Steering directions, extracted from high-resource source tasks using contrastive activation deltas, are injected during target-task inference to yield zero-shot-like transfer without context expansion or parameter update (Tang et al., 17 Jul 2025).
Generalization to thousands of behaviors: HyperSteer (hypernetwork CAST variant) trains a transformer to produce steering vectors conditioned on steering prompts, showing nearly linear improvement for out-of-distribution prompt generalization as prompt coverage increases (Sun et al., 3 Jun 2025).

5. Empirical Findings and Evaluative Benchmarks

Empirical results across these works establish:

Refusal and compliance control: CAST raises harmful prompt refusal rates to 83–90% (from model averages of 40–80%) while keeping false refusals on benign prompts under 6%, outperforming unconditional steering in all tested models (Lee et al., 2024).
Instruction compliance: Instruction-following metrics (e.g., format adherence, brevity, word inclusion) improve by 25pp or more, with CAST vectors enabling adherence without explicit instructions (Stolfo et al., 2024).
Cross-task transfer gains: Average accuracy in low-resource tasks rises by 2–5 points over strongest in-context prompting methods; cross-lingual accuracy gains range from 44.9–95.2% depending on language (Tang et al., 17 Jul 2025).
Reasoning debiasing: kNN-based CAST delivers up to 15% absolute improvement in logical accuracy for models unresponsive to static steering, reducing "content effect" bias by several-fold (Valentino et al., 18 May 2025).
Minimal side effects: Faithfulness diagnostics (KL divergence or win-rates) indicate that steering can be highly localized with little impact on unrelated model capabilities (Stoehr et al., 2024, Lee et al., 2024, Hegazy et al., 22 May 2025).
Scalability and generalization: HyperSteer matches or outperforms fine-tuned prompt-based steering for both in-distribution and novel prompts, with per-prompt training cost decreasing as the number of target behaviors grows (Sun et al., 3 Jun 2025).

6. Implementation and Practical Considerations

Key implementation guidelines include:

Vector computation is limited to lightweight forward passes (e.g., $D^-$ 7) and requires no parameter update of the base model (Shnaidman et al., 30 Dec 2025).
Integration via tensor hooks in frameworks such as PyTorch suffices for most additive and multiplicative interventions.
Overhead is minimal: Per-token and per-layer computations scale linearly with batch size and dimension, remaining negligible relative to the base model’s forward pass (Shnaidman et al., 30 Dec 2025, Hegazy et al., 22 May 2025).
Hyperparameter selection (steering layers, $D^-$ 8, thresholds) can be automated by grid search or data-driven tuning; practical ranges are documented per model and behavior (Lee et al., 2024, Hegazy et al., 22 May 2025).
Compositional steering (injecting multiple vectors at separate layers) is robust to destructive interference if layer assignments are sparse (Stolfo et al., 2024).
Deployment requires white-box access: CAST cannot be applied to closed-source APIs that conceal internal activations (Tang et al., 17 Jul 2025, Sun et al., 3 Jun 2025).

7. Limitations and Future Directions

CAST exhibits several notable boundaries:

White-box activation access is essential; deployment on black-box APIs or for multimodal architectures is not addressed (Tang et al., 17 Jul 2025).
Dependence on vector quality: If the refusal or instruction vector is mis-specified, steering efficacy and safety may degrade (Hegazy et al., 22 May 2025).
Calibration and classifier drift: Gating functions and controller MLPs may be vulnerable to adversarial attacks or require periodic recalibration (Hegazy et al., 22 May 2025).
Scope for hybridization: Extensions under discussion include multitask steering (aggregating vectors for safety, honesty, domain compliance), continual steering, and hybrid neuro-symbolic pipelines (Tang et al., 17 Jul 2025, Hegazy et al., 22 May 2025, Shnaidman et al., 30 Dec 2025).
Model-specific layer and position localization: Empirical studies reveal that optimal intervention points cluster in post-MLP residuals and final third of layers for most LLMs (Shnaidman et al., 30 Dec 2025, Valentino et al., 18 May 2025).

In summary, Conditional Activation Steering constitutes a rigorously validated, modular, and efficient paradigm for behavioral and output control in foundation models, providing scalable alternatives to global steering and model fine-tuning while enabling nuanced, rules-based, and contextually sensitive interventions across a variety of open problems in large-scale language generation (Lee et al., 2024, Hegazy et al., 22 May 2025, Sun et al., 3 Jun 2025, Shnaidman et al., 30 Dec 2025, Stolfo et al., 2024, Tang et al., 17 Jul 2025, Stoehr et al., 2024, Valentino et al., 18 May 2025).