- The paper demonstrates that LLMs encode emotion early in the network and peak at mid-layers, with probe accuracy reflecting internal emotional structure.
- It employs a novel 400k Reddit corpus and layer-wise MLP probes to map latent emotion clusters, revealing clear separability among emotion categories.
- Findings suggest prompt-based modulation and persistent token-level emotional signals, offering actionable insights for AI interpretability and safety.
Systematic Analysis of Emotional Representation in LLMs
Introduction
This paper presents a comprehensive investigation into the internal mechanisms by which LLMs represent, retain, and express emotion. The authors address a critical gap in affective computing: while LLMs are known to simulate emotional intelligence, the structure and dynamics of their latent emotional representations remain poorly understood. By constructing a large-scale, emotion-balanced Reddit corpus and employing layer-wise probing techniques on Qwen3 and LLaMA models, the paper elucidates the geometry, emergence, malleability, and persistence of emotional signals within transformer architectures.
Dataset Construction and Methodology
The authors introduce a novel dataset of approximately 400,000 utterances, balanced across seven emotion categories (six Ekman emotions plus neutral). The dataset is generated through a multi-stage pipeline: initial classification of Reddit comments, rewriting neutral utterances to infuse target emotions, and synthetic generation of prototypical examples. This approach yields a corpus with high diversity in length and style, suitable for probing the emotional capabilities of LLMs.
For model analysis, the paper employs lightweight, supervised probes—two-layer MLP classifiers—attached to the hidden states of frozen LLMs at various depths. This methodology enables direct readout of the emotional information encoded in the activations, without fine-tuning the base model. The probes are trained and evaluated on balanced splits of the dataset, with careful handling of class imbalance via oversampling and undersampling strategies.
Layer-Wise Emotional Signal Emergence
A central finding is that emotional representations in LLMs are not confined to the final layer. Instead, the emotional signal emerges early and peaks in the middle layers of the transformer. Probing accuracy at the input embedding layer is at chance, but rises sharply in the first quarter of the network and saturates at mid-to-late layers. For example, in Qwen3-4B and LLaMA 3.2-3B, peak probe accuracy is observed at 75% and 50% depth, respectively, with a slight decline at the final layer. This suggests that the network constructs high-level semantic abstractions, including emotion, in its intermediate representations, while the final layers are more task-specific and less emotionally distinct.
Internal Geometry and Separability of Emotion
Visualization of probe outputs via PCA and KDE reveals that LLMs develop a well-defined internal geometry for emotion. Larger models exhibit tighter, more separable clusters for each emotion category. The spatial arrangement of these clusters reflects semantic relationships: anger and disgust are nearly inseparable, joy and surprise form a positive group, and fear and sadness cluster as downcast emotions, with neutral at the center. This structure aligns with psychological models and demonstrates that LLMs encode emotion along meaningful dimensions in their latent space.
Malleability and Prompt-Based Control
The paper demonstrates that the internal emotional state of LLMs is highly malleable. Simple system prompts (e.g., "You are very emotional" or "You always remain calm and composed") can significantly shift the model's expressed emotional tone. Under default conditions, models tend to suppress negative emotions and favor neutral or positive responses. Emotional prompts increase the recall for sadness but decrease precision, indicating a tendency to over-apply sympathetic tones. The "calm" prompt has minimal effect, reinforcing the default professional posture. These results highlight the potential for prompt-based control of emotional expression, but also expose asymmetries in the generative policy for different emotions.
Temporal Persistence of Emotional Tone
Analysis of token-level hidden states in generated replies reveals that the initial emotional stimulus from the user's input remains detectable in the model's activations for hundreds of subsequent tokens. Negative emotions such as anger and fear exhibit the longest persistence, while positive emotions like joy and surprise decay rapidly. This asymmetry reflects the models' tendency to maintain a calming or explanatory tone in response to negative input, while quickly reverting to neutrality after positive input. The persistence curves quantify the "half-life" of emotional signals in LLMs and provide insight into the dynamics of affective state propagation.
Probe-based emotion classification achieves high accuracy, consistently outperforming zero-shot prompting. The performance gap between probe and zero-shot narrows as model size increases, indicating that larger models' generative outputs more fully reflect their latent emotional representations. Qwen models show higher zero-shot coverage than LLaMA, likely due to dataset filtering biases. All experiments are conducted with efficient resource utilization (mixed-precision inference and training on RTX5090 GPUs), and the open-source toolkit enables scalable analysis across model families and sizes.
Implications and Future Directions
The findings have significant implications for interpretability, safety, and alignment in AI systems. The clear separability and malleability of emotional states suggest potential for transparent, post-hoc safety mechanisms, but also raise concerns about manipulation and misuse. The paper's limitations include reliance on English Reddit-style text, use of the simplified Ekman taxonomy, potential classifier circularity, and focus on single-turn, text-only prompts. Future work should address cross-lingual robustness, more nuanced emotion taxonomies, multi-turn and multimodal interactions, and the development of real-time "emotion governors" for dynamic control of affective output.
Conclusion
This paper provides a rigorous, large-scale analysis of how LLMs encode, retain, and express emotion. By releasing a 400k utterance corpus and an open-source probing toolkit, the authors establish a foundation for future research in affective computing, interpretability, and AI alignment. The demonstration that emotion emerges early, peaks mid-network, and remains steerable and persistent across tokens advances our understanding of the internal mechanisms underlying emotional intelligence in LLMs.