Personality Activation Search (PAS)
- Personality Activation Search (PAS) is a framework that identifies, validates, and manipulates latent personality trait representations in language models using geometrically meaningful activation directions.
- PAS employs contrastive activation analysis, vector normalization, and dynamic trait steering to achieve fine-grained control over personality expression in both static and multi-turn scenarios.
- PAS offers state-of-the-art performance by enabling interpretable, efficient, and context-aware personality modifications without requiring gradient updates or retraining.
Personality Activation Search (PAS) refers to a framework for identifying, validating, and manipulating latent personality trait representations in the internal activations of LLMs. Rather than relying on prompt engineering or fine-tuning, PAS exploits the existence of approximately linear, semantically meaningful directions within a model’s residual stream corresponding to psychological constructs such as the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism). Key techniques include contrastive activation analysis, vector algebra over personality axes, and dynamic trait steering at inference, enabling fine-grained and compositional personality control with no gradient updates or model retraining. PAS has been implemented across diverse architectures, yielding state-of-the-art results in static and dynamic personality benchmarks, and providing a foundation for interpretable, efficient, and context-aware personality control in LLMs (Feng et al., 17 Feb 2026, Bhandari et al., 29 Oct 2025, Ma et al., 14 Jan 2026, Zhu et al., 2024).
1. Theoretical Foundations and Motivation
PAS is motivated by the observation that LLM behaviors relating to personality are not merely superficial prompt artifacts but are encoded as geometrically structured, approximately orthogonal directions in activation space. These directions can be interpreted, manipulated, and composed algebraically, supporting rigorous, mathematically grounded control over model behaviors associated with distinct personality traits.
The PAS paradigm addresses limitations of prompt-based or finetuning-based personality control, which are static, non-compositional, and cannot adapt traits at fine granularity or in context. By shifting the control locus to activation space, PAS enables:
- Discovery of trait-specific directions in the model’s hidden representations (typically via residual stream activations).
- Intensity and compositional control of traits via scalar multiplication, addition, and subtraction in activation space.
- Dynamic, context-sensitive adaptation of personality at inference time, reflecting the multi-faceted and non-static nature of human traits (Feng et al., 17 Feb 2026, Ma et al., 14 Jan 2026, Tang et al., 9 Sep 2025).
2. Methodological Frameworks for PAS
The PAS pipeline, as established in the literature, consists of the following stages:
2.1 Contrastive Trait Direction Extraction
- For each trait pole (e.g., high/low Openness), prompts are generated to selectively elicit or suppress the trait.
- The model is run on trait-relevant questions under both pole prompts. Layerwise hidden activations are collected, typically at a specific transformer layer where steerability is maximized (e.g., layer 18–20 in Llama/Qwen architectures).
- The trait vector is defined by the mean difference:
where denotes residual activations under positive/negative trait prompts (Feng et al., 17 Feb 2026, Allbert et al., 2024, Bhandari et al., 29 Oct 2025, Ma et al., 14 Jan 2026).
2.2 Trait Vector Normalization and Orthogonalization
- Trait vectors are -normalized to unit length.
- (Optionally) Gram–Schmidt orthogonalization enforces approximate independence, though empirical cosine similarities for distinct traits are typically achieved without it.
2.3 Layer and Head Selection
- Empirical sweeps identify trait-sensitive layers (e.g., layer 20).
- Alternatives leverage hybrid layer selection, aggregating across layers to maximize behavioral separability while minimizing interference and variance (Bhandari et al., 29 Oct 2025).
2.4 Subspace and Low-Rank Structure
- Trait vectors are stacked into a low-rank matrix and subjected to dimensionality reduction (SVD/PCA), revealing that personality expresses as a low-rank, interpretable subspace in hidden state space.
- Subspace projections regularize trait vectors and mitigate overfitting or spurious correlations (Bhandari et al., 29 Oct 2025).
3. Personality Algebra and Dynamic Steering
PAS supports a mathematically tractable "persona algebra" expressed as follows:
- Scalar Multiplication: Modulates trait intensity by scaling the vector:
- Vector Addition: Creates composite personas by summing trait directions:
- Vector Subtraction: Suppresses or negates selected traits.
For dynamic, context-aware control (Persona-Flow), trait coefficients are predicted for each turn based on dialog context and persona specification. The final composite direction is injected at the chosen layer, yielding contextually adapted personality expression (Feng et al., 17 Feb 2026).
Pseudocode for dynamic steering is given in PAS implementations (see Table below):
| Stage | Operation | Typical Formula / Procedure |
|---|---|---|
| Trait extraction | , ; | |
| Normalization | ||
| Orthogonalization | Gram–Schmidt (optional) | |
| Dynamic composition | ||
| Injection | (Feng et al., 17 Feb 2026, Bhandari et al., 29 Oct 2025) |
4. Evaluation Protocols and Empirical Results
PAS efficacy is established on a variety of benchmarks emphasizing both static trait control and dynamic, multi-turn persona adaptation:
- PersonalityBench: Benchmarks static trait editing with situational questions and judge model scoring. PAS achieves mean scores matching or surpassing supervised fine-tuning (e.g., 9.60 vs. 9.61 for PERSONA-BASE vs. SFT upper bound) (Feng et al., 17 Feb 2026).
- Persona-Evolve and Multi-Turn Adaptation: Measures adherence, consistency, authenticity, and information fidelity across dynamic personas; PAS secures up to 91% pairwise win rates, with robustness across model scales.
- Fluency and General Capability: Trait steering preserves or improves fluency and leaves general reasoning benchmarks (e.g., MMLU/ARC) within ±2 points of base accuracy, with no catastrophic degradation (Bhandari et al., 29 Oct 2025).
- Prospective Social-Emotional Metrics: Mixture-of-Experts PAS (PersonaFuse) generates strong gains on emotion and Theory-of-Mind benchmarks, with empirical ablations confirming dynamic routing superiority over random or static alignment (Tang et al., 9 Sep 2025).
5. Comparative Techniques and Extensions
PAS is distinguished from:
- Prompt-based and ICL Methods: PAS outperforms few-shot prompt baselines in both trait-alignment and subjective quality, while incurring lower sensitivity to prompt variants and paraphrasing (Ma et al., 14 Jan 2026).
- Training-based Interventions (DPO/PPO/LoRA/QLoRA): PAS does not require gradient updates, is highly data-efficient (∼1/6 time vs. PPO), and achieves lower composite alignment error on PAPI and related inventories (Zhu et al., 2024).
- Probe-Based and Head-Targeted Interventions: PAS applied at top-K heads and selected layers provides fine-grained control, scalable alignment, and minimal memory/computational footprint (Zhu et al., 2024).
Hybrid methods combine PAS with fine-tuning for tasks requiring both subjective alignment and high-factuality, and can be extended to new personality batteries (e.g., HEXACO, Dark Triad) by updating the prompt/dataset schema (Zhu et al., 2024). All code, vectors, and data artifacts are generally released for reproducibility (Feng et al., 17 Feb 2026).
6. Stability, Explainability, and Limitations
PAS-based personality diagnoses and interventions demonstrate higher stability and lower variance than questionnaire-based or generation-based trait evaluations, as evidenced by mean across 10 prompt variants (PVNI protocol; (Ma et al., 14 Jan 2026)). Theoretical analysis justifies the near-linearity and compositionality of PAS operations via local linearity and rank-one adaptation assumptions.
Limitations include:
- Dependence on white-box access for internal activations (problematic for closed-source or API-only LLMs).
- Potential partial trait correlation, necessitating subspace projection or regularization.
- Case-specific risk of unintended semantic entanglement or trait “bleed-through” at high intervention intensities.
- Personality alignment currently evaluated predominantly on Big Five axes and English-language datasets; extension to richer/longer, dialogic, or multilingual settings remains ongoing.
7. Ethical and Practical Considerations
PAS enables both beneficial enhancement of user experience—by calibrating assistant personas to user preferences—and introduces new risks. Misuse scenarios include:
- Induction of extreme, toxic, or manipulative personalities if unregulated.
- Data and demographic bias propagation, if personality axes are misaligned with user populations.
- Attenuation of model factuality/content fidelity when personality traits drive significant stylistic alterations.
Recommended mitigations include intervention strength constraints, activation direction regularization (against known problematic axes), human-in-the-loop audits, transparency about manipulations, and active limitation of “high-risk” trait steering for critical applications (Allbert et al., 2024, Feng et al., 17 Feb 2026). Responsible PAS adoption aligns with broader AI safety and interpretability objectives.