SoulBench: Deterministic LLM Persona Profiling
- SoulBench is a dataset protocol that underpins the Soul Engine by enabling deterministic personality control in LLMs using orthogonal latent trait representations.
- It employs a dual-head architecture with frozen reasoning layers and trainable psychometric heads to segregate core intelligence from dynamic persona injection.
- Dynamic contextual sampling in SoulBench leads to robust generalization and profiling precision, achieving a mean squared error as low as 0.0113.
The Soul Engine is a framework for personalized LLMs that formulates personality traits as orthogonal linear subspaces within model latent spaces, enabling deterministic control and profiling of persona without performing global weight updates. Developed to address the stability–plasticity dilemma in LLM personalization, the Soul Engine introduces both a new mathematical foundation—the Linear Representation Hypothesis—and a practical method using a dual-head architecture on a frozen transformer backbone, in conjunction with a dynamically sampled persona dataset, SoulBench. The framework achieves high-precision trait profiling and robust behavioral steering, presenting a mathematically grounded alternative to probabilistic prompt engineering and conventional fine-tuning (Wang, 8 Dec 2025).
1. Rationale: The Stability–Plasticity Dilemma in LLM Personalization
Prevailing methods for aligning LLMs to specific personas, such as Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA), conceptualize persona as a narrow stylistic distribution learned via stochastic weight updates. This process is subject to an “alignment tax,” where reasoning and general knowledge degrade due to catastrophic forgetting, often observable as declines on standard evaluation benchmarks (e.g., MMLU). In-Context Learning (ICL) and prompt-based methods avoid weight updates but suffer from instruction “drift,” resulting in persona dilution or “catastrophic amnesia” during extended dialogues.
The Soul Engine circumvents these issues by eschewing global weight updates on the reasoning circuits. Instead, it posits that personality can be cleanly disentangled from reasoning as orthogonal components within the latent space, activated or deactivated deterministically via vector overlays. This architecture preserves the foundational intelligence of the model while affording stable and controllable style modulation.
2. The Linear Representation Hypothesis
The framework postulates that each of the Big Five OCEAN psychometric traits—Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism—corresponds to an orthogonal linear subspace in the final hidden activations of a transformer model. Specifically:
- Let denote the final hidden embedding for a given anchor input.
- The psychometric dimension is captured by a projection matrix , whose rows each define a trait subspace.
- The orthogonality constraint is enforced as .
- Given a ground-truth personality trait vector , the model’s prediction is .
To regularize the separation of trait subspaces, the loss function incorporates
which penalizes non-orthogonality between personality directions.
3. Dual-Head Architecture and Latent Persona Manipulation
The architecture deploys a stratified freezing strategy atop the Qwen-2.5 transformer backbone:
- Layers () are frozen to preserve syntax and core reasoning manifolds.
- Upper layers and final normalization heads () are trainable, allowing psychometric probing.
- For an anchor input , the final hidden state is .
- A two-headed output structure is used:
- Identity Head (contrastive): (2-layer MLP).
- Psychometric Head (linear): .
- For trait injection, a desired score vector is mapped into the latent space as , which can then be added to intermediate activations, preserving invariance.
| Component | Operation | Dimensionality |
|---|---|---|
| Final hidden state | ||
| Identity head | ||
| Psychometric head | ||
| Persona vector |
This architecture enables “deterministic steering”: during inference, desired personalities are injected via vector arithmetic, specifically at residual streams of selected layers. The resulting hidden state modification is , with tuning personality strength.
4. SoulBench Dataset and Dynamic Contextual Sampling
The SoulBench protocol employs dynamic contextual sampling to avoid overfitting on exogenous linguistic content:
- For each character , comprises all sentences attributed to .
- At each training iteration , random sentences are concatenated to form anchor , with .
- This approach induces a virtual dataset of size , compelling the model to capture stylistic invariants.
- Ground-truth OCEAN scores are assigned using the Doubao-Seed-1.6 teacher model, prompted by complete character profiles for psychological consistency.
This sampling scheme is fundamental for robust generalization, focusing the psychometric head on structural persona features, independent of content memorization.
5. Profiling Accuracy, Disentanglement, and Persona Injection
The framework achieves high-precision personality profiling and orthogonal disentanglement:
- The model realized a Mean Squared Error (MSE) of $0.0113$ on validation, corresponding to precision relative to the ground-truth teacher.
- T-SNE visualizations of 1,000 character embeddings revealed distinct, continuous gradients in each OCEAN trait and minimal cluster overlap, confirming the geometric orthogonality in .
- Zero-shot personality injection is enabled by defining a neutral mean embedding and a target persona mean , with . For deterministic style shifts, the model modifies activation streams as (e.g., Neutral Villain).
- Parameter sweeps showed middle layers and yield maximal persona adherence (Villainy score ) and high compositional coherence (), as compared to baselines.
6. Ablation Results and Regularization Effects
Key findings from ablation studies include:
- Freezing at least $20/24$ layers sustains low MSE () in trait profiling without loss to core reasoning, while further fine-tuning increases risk of knowledge degradation.
- Omitting the orthogonality regularizer escalates trait manifold entanglement and raises MSE by .
- Layer and strength grid searches identified injection regimes for maximal persona expressivity with minimal off-target degradation.
These studies delineate the hyperparameter regimes required for stable, disentangled persona control and validate the structural separation mandated by the Linear Representation Hypothesis.
7. Mathematical Justification and Safety Considerations
By formulating personality as a linear manifold orthogonal to core reasoning circuits, the Soul Engine avoids destructive model updates. Deterministic latent interventions via vector arithmetic confer precision and reversibility unattainable with stochastic prompt reformulation.
From a safety perspective:
- Malicious behavioral directions, such as the “Dark Triad,” can be identified within and subtracted at runtime (“Safety Interceptor”), preemptively neutralizing harmful predispositions in generated content.
- Latent-level guardrails on the persona manifold are asserted to offer superior robustness and coverage compared to surface-level token or rule-based filtering, as interventions operate directly at the level of semantic intent.
A plausible implication is that treating personality as a geometric latent construct enables both rigorous behavioral guarantees and flexible personalization, challenging the primacy of fine-tuning for secure and expressive agent alignment (Wang, 8 Dec 2025).