Activation Extraction & Injection Techniques

Updated 17 November 2025

Activation extraction and injection are methods for modifying system behavior by capturing and altering intermediate states in both digital and physical systems.
The process involves extracting activation tensors or signals and re-injecting computed deltas to steer outcomes in domains like language models, accelerator physics, and quantum systems.
Applications demonstrate improved model accuracy, property control, and robustness with reported metrics such as increased factual accuracy in LLMs and precise control in high-energy devices.

Activation extraction and injection are intervention techniques for modifying system behavior by manipulating intermediate states—so-called “activations”—within a physical or computational process. In contemporary research, these concepts arise prominently within LLMs, accelerator physics, quantum optoelectronics, and adversarial machine learning. The methods span mechanisms for targeted control, system alignment, property steering, and adversarial manipulation, all implemented via the direct manipulation—extraction, differential analysis, and re-injection—of internal activations within complex systems.

1. Fundamentals and Diverse Contexts of Activation Extraction and Injection

Activation extraction is the process of selecting, recording, and transforming internal system states that encode information and mediate downstream responses. In artificial neural networks and transformers, activations are the high-dimensional representations embedded in neural units or residual streams at each layer for a given input. Extraction is typically executed by running the model on specific inputs and logging the corresponding activation tensors at designated layers or tokens. In physical systems, such as particle accelerators and quantum devices, “activation” refers to selected excitations or currents that can be absorbed, measured, or transferred by specialized components.

Activation injection refers to the process of deliberately modifying a system’s internal state, either by adding specific vectors (in software) or by applying engineered fields or pulses (in hardware). The injected modifications are typically computed as differences (deltas) between activations associated with desired (reference) and undesired (original or control) outcomes. Injection is performed during system operation, with the aim of shifting the output, alignment, or property of interest in a controlled, interpretable, and sometimes prompt- or property-specific manner.

2. Activation Extraction and Injection in LLMs

In modern LLMs, activation extraction and injection underpin a spectrum of property steering and alignment techniques, ranging from factual correctness interventions to safety attacks and adversarial prompt optimization.

Prompt-Specific Steering and Fusion Steering

Fusion Steering introduces a prompt-conditioned activation injection pipeline for factual accuracy improvement in LLM-based QA (Chang et al., 28 May 2025). For a prompt $x$ :

A reference completion is formed by concatenating $x$ with the ground-truth answer and a model-generated explanation.
Activations $a^{\rm ref}_{l,t}$ are recorded at every layer $l$ and token $t$ for the enriched prompt; mean activations $h_l$ are computed by averaging over answer+explanation tokens.
The original (question-only) mean activation $\mu_l$ is obtained analogously.
The activation delta $\Delta_l = h_l - \mu_l$ quantifies the shift required for semantic enrichment.
During inference, at each layer and token, the model injects a scaled delta:

$a_{l,t}^{\rm steered}(x) = a_{l,t}(x) + w_l \Delta_l$

with injection weights $w_l$ —either constant across layers (“full-layer”) or group-specific (“segmented”)—optimized per-prompt via Optuna targeting a utility balancing token-overlap with the ground-truth and fluency (as measured by normalized perplexity).

Empirical results demonstrate that segmented steering achieves $25.4\%$ accuracy ( $S_{\rm composite} \geq 0.6$ ), vastly outperforming the baseline ( $3.5\%$ ) and full-layer steering ( $16.2\%$ ) on hard QA prompts. The strict rubric shows segmented steering increases “fully correct” answers from $0.0\%$ to $13.1\%$ .

A plausible implication is that finely tunable, per-prompt, per-layer activation injection allows superior and interpretable control over LLM output quality—for both factuality and fluency—compared to global, single-layer, or static steering.

Multi-Property and Dynamic Activation Steering

Dynamic Activation Composition (Scalena et al., 2024) generalizes extraction and injection for conditioning LLMs on multiple properties (e.g., safety, formality, language). Extraction proceeds by constructing contrastive prompt sets for each property, computing per-layer, per-head activation differences:

$\Delta h_{l,h}^{(p,i)} = v_{l,h}^{+,(i)} - v_{l,h}^{-,(i)}$

where $v_{l,h}^{+,(i)}$ and $v_{l,h}^{-,(i)}$ are averages over positive and negative exemplars at each generation step $i$ . Injection is performed at each forward pass via:

$h_{l,h}^{\prime(i)} = h_{l,h}^{(i)} + \alpha_l^{(i)} \Delta h_{l,h}^{(p,i)}$

with $\alpha_l^{(i)}$ an intervention strength determined dynamically using the KL-divergence between the strongly steered and original next-token distributions, truncated to the top- $p$ tokens for stability. This automated modulation of steering avoids the need for property-specific hyperparameter sweeps and maintains both property adherence (≥90% accuracy) and fluency (≤5% Δppl) even under simultaneous multi-property steering.

Trojan Activation Attacks and Security Implications

Trojan Activation Attack (TA²) (Wang et al., 2023) repurposes activation injection for adversarial model manipulation. Here, a fixed steering vector $z^{l^*}$ is computed by contrasting activations in a clean (aligned) and a teacher (misaligned) model at a selected layer $l^*$ , over a small set of prompts:

$z^{l^*} = \frac{1}{N}\sum_{i=1}^N (a^{l^*}_+(p_i) - a^{l^*}_-(p_i))$

At inference, this vector (scaled by a tuned factor $c^*$ ) is added to hidden activations, subverting safety alignments and causing large drops in refusal and truthfulness rates. This approach is highly effective, adds minimal overhead, and is imperceptible to prompt sanitization.

3. Extraction and Injection in Physical and Quantum Systems

Particle Accelerators: Septum and Kicker Magnets

In accelerator physics (Barnes et al., 2011), “activation extraction” and “injection” occur via a sophisticated combination of electromagnetic septa (for spatial selection) and kicker magnets (for rapid, time-selective deflection):

The extraction/injection point is equipped with a DC/slow septum to separate circulating and transferred beams in space, followed by a high-speed kicker magnet creating a rectangular pulse ( $B(t)$ ). The rise/fall time of the kicker (10–300 ns) enables precise bunch-selective operation.
The hardware involves ferrite-loaded, single-turn transmission-line C-core magnets, matched to pulse-forming networks or lines (PFN/PFL), and driven by high-power switches (thyratrons, MOSFETs, GTO thyristors).
System-level performance metrics include peak current (up to 20 kA), field uniformity, rise/fall time, impedance matching, and operation at repetition rates up to hundreds of Hz.
Design challenges include minimization of beam-coupling impedance, eddy currents, mechanical stresses, and insulation breakdown.

Quantum Cascade Devices: Intersubband Polaritons

In quantum optoelectronics (Lagrée et al., 2023), activation injection/extraction are embodied in the electrical control and measurement of intersubband (ISB) polaritons within semiconductor microcavities:

The theoretical model employs a Hamiltonian describing the ISB plasmon mode, the cavity photon, and a voltage-tunable extraction/injection mode, all coupled coherently.
The density matrix with Lindblad dissipation tracks population and decoherence among bright/dark states.
Bright–dark selection rules strictly govern which tunneling transitions are allowed, constraining activation transport.
Both electrical extraction (photocurrent) and possible injection are evaluated via population measurements in steady state, matching experimental mid-IR photocurrent spectra both quantitatively and in bias-dependent evolution of polariton peaks.

4. Adversarial Prompt Injection via Activation-Guided MCMC

Recent advances leverage activation extraction and injection to optimize adversarial prompts in LLMs using activation-guided sampling (Li et al., 9 Sep 2025):

Internal activations at a late transformer layer (layer 25) are extracted from a white-box surrogate for each candidate prompt.
A mean-pooled representation is used as input to a trained energy-based model (EBM) that predicts attack (prompt) success.
Token-level proposals are generated using a masked LLM, and MCMC sampling is guided by the EBM—accepting prompt modifications leading to lower “energy” (higher-success activations).
The transferability of optimized prompts demonstrates a tight correlation between activation embedding and exploitability across models and tasks, achieving a 49.6% attack success rate across five LLMs, and outperforming both human-crafted and genetic algorithm baselines.

5. Implementation Strategies, Scaling, and Interpretability

Layer and Token Resolution

Effective steering requires careful selection of extraction/injection loci—either full-network (as in Fusion Steering), segmented (layer groups), or single-layer (as in TA²), with empirical evidence showing that distributed or segmented approaches outperform single-point interventions for complex objectives (Chang et al., 28 May 2025).

Sparse versus Dense Intervention

While most methods inject dense $\Delta$ vectors across all neurons or tokens, recent insights point towards sparse, neuron-level representations using tools like Neuronpedia or crosscoders. This promises interpretability, compositionality, and reduced computation.

Optimization and Stability

Optimization of injection strengths is performed per-prompt, typically via Bayesian optimization (e.g., Optuna), and constrained to avoid over-steering and numerical instability. In dynamic settings, information-theoretic (KL-based) modulation allows per-step adaptation, balancing property control and fluency (Scalena et al., 2024).

Security and Defenses

Activation-based attacks can bypass traditional prompt sanitization and require internal validation or randomized perturbations for mitigation, though with trade-offs in model quality or robustness (Wang et al., 2023).

6. Impact, Generalization, and Prospects

These activation-centric intervention methods have demonstrated substantial impact across domains:

In LLMs, they enable prompt-specific, property-specific, and multi-property steering with measurable improvements in factual accuracy, stylistic control, and alignment or—adversarially—the ability to breach alignment with minimal resource overhead.
In hardware systems, they underpin precise timing and control in high-energy and quantum devices, ensuring reliability and selectivity amidst physical complexity.
Prospects for scalable interpretability rest on the evolution from dense, black-box $\Delta$ vectors towards sparse, concept-level steering and modular intervention, potentially aligning control, efficiency, and transparency even in very large models (Chang et al., 28 May 2025).

A plausible implication is that activation extraction and injection serve as unifying primitives for both the intentional engineering and adversarial circumvention of complex system behavior, and their evolution will further influence the interpretability and security of both AI and hardware systems.