Mechanistic basis of persona vector generalization
Determine the mechanistic basis by which persona vectors—defined as difference-in-means directions in the residual stream computed from trait-expressing versus trait-suppressing responses—causally influence expression of the associated trait under activation steering and predict finetuning-induced behavioral shifts in transformer-based chat models.
References
The mechanistic basis for this generalization is unclear, though we suspect it has to do with personas being latent factors that persist for many tokens; thus, recent expression of a persona should predict its near-future expression.
— Persona Vectors: Monitoring and Controlling Character Traits in Language Models
(2507.21509 - Chen et al., 29 Jul 2025) in Conclusion