Cartridge Activation Space Transfer (CAST)

Updated 23 October 2025

CAST is a framework that transfers task-specific behaviors, safe steering, and knowledge between LLMs by mapping their internal activation spaces.
It employs autoencoder and projection methodologies to align hidden states, enabling efficient safety interventions and cross-task knowledge transfer.
Empirical evaluations show CAST retains 85–95% of adapter performance, offering scalable, architecture-agnostic transfer of functionalities.

Cartridge Activation Space Transfer (CAST) is a methodological framework for transferring task-specific behaviors, safety interventions, and cross-task knowledge between LLMs by learning explicit mappings between their internal activation spaces. Distinct from approaches that operate in parameter or input space, CAST leverages the representational geometry of neural activations—typically specific layers' hidden states—to enable robust, efficient, and architecture-agnostic transfer of learned behaviors. The framework has been instantiated for safety steering, cross-task knowledge transfer, and behavioral portability for adapters such as LoRA, consistently demonstrating superior retention of functionality and practical efficiency.

1. Conceptual Foundations and Rationale

CAST is predicated on the observation of universality and structural consistency in the representation manifolds of modern LLMs. As models converge toward similar latent semantics—even across divergent architectures and modalities—their internal activations become viable substrates for intervention transfer. This universality motivates the direct mapping of activation spaces in lieu of parameter alignment or prompt-level demonstrations, thus bypassing architectural “lock-in” and context window constraints.

In CAST, knowledge, safety features, or adapter-induced behaviors encoded in the hidden states of a "source" model are recast into the activation domain of a "target" model by means of learned projection or autoencoder mappings. This enables not only the direct transplantation of specific skills, but also the scalable and dynamic modulation of model behavior using small, reusable steering cartridges, without retraining or inflating input length (Oozeer et al., 6 Mar 2025, Tang et al., 17 Jul 2025, Kari, 19 Oct 2025).

2. Activation Space Mapping Methodologies

CAST operationalizes transfer via activation space mapping, employing either autoencoder architectures or linear projection heads depending on the task and heterogeneity of the involved models.

For safety interventions, an autoencoder is trained to approximate the mapping function between source and target hidden states at a chosen "steerable" layer. The mapping is given by

$\hat{y} = W_2 \cdot \text{ReLU}(W_1 \cdot x + b_1)$

where $x$ is the source activation, and $(W_1, b_1, W_2)$ are learned weights and bias. Optimization is performed using a composite loss:

Mean squared error for direct activation reconstruction
KL divergence between token distributions
Cosine similarity for representation alignment

In LoRA behavior transfer, lightweight bidirectional projection heads ( $P_{T\rightarrow S}$ , $P_{S\rightarrow T}$ ) are trained to achieve manifold translation:

Source activations $x_T$ are mapped to $x'_S$
The frozen LoRA behavioral kernel is applied in source space: $\Delta y'_S = B_S \cdot A_S(x'_S)$
The behavioral delta is projected back to the target: $\Delta y_T = P_{S\rightarrow T}(\Delta y'_S)$
The target activation is updated: $x_T \gets x_T + \Delta y_T$

The mapping is always trained without task-specific fine-tuning; general text corpora are used for the transfer objectives.

3. Steering Vectors and Safety Interventions

A central concept in CAST is the "steering vector" (Editor's term), an activation-direction identified via contrastive analysis between desirable and undesirable behavior samples. Formally,

$v = \mu_{\text{undesired}} - \mu_{\text{desired}}$

where $\mu$ are the average hidden states from labeled prompt sets. Such vectors can be injected or ablated from model activations, producing predictable modifications in output (e.g., suppression of backdoor triggers or enhancement of refusal behaviors).

CAST enables steering vector transfer by mapping these vectors across models' latent spaces. Minimal layer patching—often just replacing activations at one critical transformer block—can substantially alter model safety profile (e.g., backdoor trigger rates reduced, refusal behavior imparted). The method achieves alignment of output distributions between base and fine-tuned models, and allows toggling between safe/unsafe modes using a mapped vector as a "lightweight safety switch"(Oozeer et al., 6 Mar 2025).

4. Cross-task Transfer via Contrastive Activation Steering

In domain and linguistic transfer, CAST leverages activation steering to infuse low-resource queries with the semantic richness intrinsic to in-context learning. The process includes:

Extraction of “contrastive representation–enhanced activations” from high-resource tasks:

$C_L = \frac{1}{n} \sum_{i=1}^n [A_L(f_i) - A_L(z_i)]$

where $A_L$ is the layer- $L$ activation for few-shot ( $f_i$ ) and zero-shot ( $z_i$ ) prompts.

Injection during inference:

$\hat{h}_{L-1} = h_{L-1} + \lambda \cdot C_L$

with $\lambda$ controlling the injection strength.

Robust sample selection is performed via influence-diffusion graphs. Each sample’s impact and representation diversity are synthesized:

$f_\mathcal{G}(v) = I(v) + \gamma D(v)$

Greedy selection ensures that the steering activation injected covers influential and diverse characteristics from the high-resource source, maximizing downstream task transfer performance (Tang et al., 17 Jul 2025).

5. Adapter Portability and Activation Manifold Projection

CAST enables adapter portability (for adapters such as LoRA) between dissimilar LLM architectures, circumventing brittle weight-space alignment. The process realizes direct activation manifold projection, with the LoRA adapter treated as a behavioral kernel operationalized via learned projections:

Target activation mapped into source manifold
Behavioral kernel applied frozen
Output mapped back and injected into the target stream

The system is trained for functional equivalence (output logits alignment via KL divergence) and geometric alignment (hidden state matching via mean squared error), yielding transfer effectiveness between heterogeneous models such as Llama-2 and Mistral (Kari, 19 Oct 2025). Empirical results report retention of 85–95% of retrained adapter performance, outpacing conventional weight-space transfer (60–80%).

6. Empirical Performance, Scalability, and Efficiency

CAST frameworks have been extensively evaluated over multiple axes:

Safety steering transfer across model families (Llama, Qwen, Gemma) and sizes (1B ↔ 3B)
Adapter transfer between architectures (e.g., Llama-2 ↔ Mistral, GPT-2 ↔ GPT-2-medium)
Cross-task and cross-lingual transfer using structured benchmarks (ARC, AG-news, BoolQ, MNLI, MARC)

Across these experiments, metrics such as ROUGE, BLEURT, BERTScore, KL-divergence, and accuracy consistently confirm that activation-space transfer preserves semantic integrity of outputs while efficiently imparting new behaviors. The overhead is negligible (mapped autoencoder sometimes constituting only 0.32% of the parameters), inference remains fast, and context-limit constraints are eliminated due to latent-space operation.

7. Applications, Limitations, and Future Directions

CAST holds direct implications for AI safety, model interoperability, and scalable knowledge transfer:

Provides modular, on-the-fly safety controllers by swapping activation cartridges without retraining or extensive weight patching.
Enables true zero-shot portability of adapter behaviors—even across families and granularities of architectures—supporting rapid deployment and task specialization.
Facilitates rigorous mechanistic interpretability by exposing functional activation structures amenable to direct manipulation.

Limitations include dependency on access to internal activations—which may restrict use in closed-source or black-box models. Prospective research directions involve activation approximation, expansion to multimodal models, refined sample selection metrics, and dynamic adaptation mechanisms (e.g., layerwise or strength-tuned injection). This suggests that activation-space methods represent a foundational shift in efficient and generalizable machine learning transfer paradigms, promising broader impacts across AI safety, transfer learning, and deployment strategies.