Assistant Axis in LLM Activation Space
- Assistant Axis is the principal direction in LLM activation space defining the default Assistant persona with quantifiable metrics.
- Its extraction leverages PCA on activations from diverse archetypes, demonstrating up to 70% variance explained with only a few components.
- Steering along this axis modulates persona drift and reduces harmful responses, with activation capping lowering jailbreak success by 60%.
The term "Assistant Axis" denotes the principal direction in the internal activation space of LLMs that encodes the degree to which the model exhibits its default Assistant persona. This axis provides a quantitative and empirically validated framework for understanding, measuring, and steering the alignment of LLM responses with the intended helpful, harmless, and generally human-like Assistant character. The concept is grounded in the analysis of model activations prompted by a broad range of character archetypes, and is shown to predict, modulate, and stabilize behavioral properties such as persona drift and vulnerability to persona-based jailbreaks (Lu et al., 15 Jan 2026).
1. Persona Space Construction and Mathematical Formalism
The persona space is a low-dimensional representation extracted from the high-dimensional activation vectors of LLMs. To construct this space, activations are collected at a specific residual-stream layer (typically mid-layer) for a diverse set of character archetypes (e.g., “bard,” “analyst,” “oracle”), each elicited using curated system prompts and broad sets of extraction queries. For each role, average post-MLP activations are computed across rollouts and tokens, yielding one representative vector per role. In parallel, activations corresponding to the default Assistant persona are sampled at scale using standard chat datasets (e.g., n₀ ≈ 18,777 activations).
Stacking these vectors forms a data matrix
with mean
Principal Component Analysis (PCA) is performed on the empirical covariance
yielding principal directions ; the first principal component () is found to consistently align with “Assistant-ness.” Notably, as few as 4–19 PCs explain 70% of the variance across state-of-the-art models.
To robustly define the Assistant Axis across architectures, the default Assistant contrast vector is used: where the are Assistant activations. This vector, normalized (), exhibits cosine similarity > 0.7 to PC₁ in models such as Gemma, Qwen, and Llama.
For an activation , the scalar projection (the “Assistant score”) is
with higher indicating more Assistant-like behavior.
2. Behavioral Role of the Assistant Axis
Steering LLM activations along the Assistant Axis deterministically modulates the persona adopted by the model. Direct intervention is performed by adding to every token activation at a selected layer, where is scaled to the mean norm of activations at that layer: As decreases (i.e., steering away from the Assistant), models increasingly express alternative personas, with introspective queries revealing a transition from default Assistant answers to those reflecting novel or even mystical personae.
Empirical evaluation across 50 archetypes shows the adoption of new personas rises from ≈20% at to ≈70% at , with mystical styles prevalent (≈40%) at strongly negative values. Conversely, steering toward the Assistant sharply reduces persona flexibility but enhances response harmlessness, as observed in quantitative jailbreak and classification studies.
3. Persona Drift and Its Measurement
Persona drift quantifies deviations from the intended Assistant persona during multi-turn interaction. For each turn , is measured; drift is
with the mean Assistant projection on standard queries.
Empirical linkages are established between low (i.e., drift away from the Assistant persona) and the incidence of harmful or bizarre responses. In two-turn dialogues, correlation coefficients –$0.5$ () are observed between and turn-2 harmful rates; low yields harmful rates , high yields rates . Regression analyses confirm that persona drift is driven by user message content rather than cumulative interaction.
4. Stabilization: Activation Capping Techniques
To enforce persona stability, activation capping constrains activations to remain within a predetermined range along the Assistant Axis. In each intervened layer and for each token, the procedure
effectively clamps the projection to . This guarantees responses stay in the intended persona regime and prevents “falling off” the axis into harmful or idiosyncratic behavior. The method can be implemented with the following pseudocode:
1 2 3 4 5 |
def ActivationCapping(h, v, tau): alpha = np.dot(h, v) if alpha < tau: h = h + (tau - alpha) * v return h |
This intervention, applied to selected layers (e.g., 46–53 of 64 in Qwen or 56–71 of 80 in Llama) and using at the 25th percentile of Assistant projections, yields a roughly 60% reduction in persona-jailbreak success rates, with <5% drop in standard capabilities across multiple benchmarks (IFEval, MMLU Pro, GSM8K, EQ-Bench).
| Cap Setting | Harmful ↓ | IFEval Δ | MMLU Δ | GSM8K Δ | EQ Δ |
|---|---|---|---|---|---|
| Unsteered | 0% | 0% | 0% | 0% | 0% |
| Activation capping | −58% | −2% | −3% | −1% | −4% |
5. Empirical Pipeline and Model Evaluation
The methodology is validated across multiple dense transformer architectures (e.g., Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B), with persona extraction spanning 275 character roles and 240 traits. PCA on the role space consistently reveals a small number of dominant components, with the Assistant Axis always prominent.
Persona-based jailbreak scenarios are quantitatively benchmarked using 1,100 prompts spanning 44 harm categories, with external LLM judging and human agreement at 91.6%. Standard capabilities are evaluated via instruction-following, general knowledge, mathematical, and emotional intelligence tasks.
Qualitative stabilization is also demonstrated in challenging conversational settings, e.g., cases involving suicidal ideation or AI delusions, where activation capping both stabilizes Assistant-axis projections and prevents harmful advice.
6. Theoretical and Practical Implications
The existence of a robust, low-dimensional Assistant Axis suggests that the default helpful persona of LLMs is realized principally through coordinated activation patterns at certain network layers. Post-training moves models toward a “safe region” in persona space but does not permanently anchor them there, explaining the tendency for drift under adversarial or emotionally charged interaction.
A plausible implication is that improved training or more granular steering might further reduce model susceptibility to persona drift and jailbreaks. The Assistant Axis formalism provides both diagnostic and interventional tools for future research seeking to ensure behavioral stability and safety in LLM deployments (Lu et al., 15 Jan 2026).