Emotion Intensity & Authorship Robustness
- The paper demonstrates that speaker-agnostic emotion vectors enable precise intensity control in TTS while preserving speaker identity.
- It shows that LLM prompt conditioning achieves consistent emotional expression, though paraphrasing can weaken authorship attribution.
- Empirical evaluations reveal robust performance in both modalities, with high naturalness (MOS) and strong F1 scores for emotion classification.
Emotion intensity and authorship robustness are intersecting domains that probe how models and humans encode, control, and attribute affective information across modalities and sources. Advances in speech synthesis, notably via parameter-space "emotion vectors," and evaluations of LLMs’ (LLMs) emotional coherence, together define current understanding in this area. These lines of research address whether high affective expressivity facilitates, inhibits, or confounds accurate authorship recognition, and how state-of-the-art generative systems control intensity without compromising speaker or author identity.
1. Mathematical Formalisms for Emotion Intensity
Controlling emotion intensity in generative models is achieved either by explicit arithmetic in parameter space (for TTS) or through prompt specification (for LLMs).
In speech synthesis, emotion intensity control is operationalized by scaling an "emotion vector" . For a single-speaker baseline, letting be parameters of a neutral TTS model and the emotional version (both in ), the emotion vector is . Applying emotion transfer to a target speaker B uses
where sets intensity. However, this approach is inherently speaker-specific and encodes idiosyncratic timbre (Murata et al., 4 Jul 2025).
To mitigate this, the speaker-agnostic formulation constructs a vector via multi-speaker fine-tuning:
which enables transfer and intensity control across arbitrary target speakers with minimal identity leakage.
In LLM-based text, elicitation of emotional content relies on prompt conditioning, but does not afford scalar intensity control. Categories (e.g., "anger," "trust") span a de facto spectrum of arousal, with high-arousal states providing more pronounced textual cues (Alsadhan, 24 Mar 2026).
2. Model Architectures and Conditioning Schemes
Emotion intensity control and authorship robustness are directly influenced by underlying model architectures and their conditioning methods.
In TTS, the core backbone is Conformer-FastSpeech2 (CFS2), conditioned on 256-dimensional x-vectors extracted from averaged neutral utterances per speaker via a pre-trained SpeechBrain encoder. The initial multi-speaker neutral model () is trained on neutral data from both ESD and VCTK corpora; no disentanglement losses are used. Fine-tuning for emotion employs only the emotional subset, maintaining x-vector conditioning for identity preservation (Murata et al., 4 Jul 2025).
LLMs, such as GPT-4o, Gemini, DeepSeek, and others, are conditioned via uniform prompts to reflect categories of emotion/personality, requiring models to generate text lacking explicit emotion lexeme mentions. Temperature is set to 1.0 with default nucleus/top-k sampling (Alsadhan, 24 Mar 2026).
3. Resolution of Cross-Speaker and Cross-Author Mismatch
Applying speaker-specific emotion vectors across speakers corrupts target identity: 0 shifts speaker B’s timbre toward speaker A, violating authorship robustness (Murata et al., 4 Jul 2025). The aggregation of multi-speaker emotion data in the speaker-agnostic vector (1) factors out idiosyncratic characteristics, resulting in a prosody–dominated transformation that preserves speaker identity and generalizes to unseen targets, including zero-shot settings.
For text, authorship attribution remains highly robust when emotional content is explicit; surface stylistic cues dominate classifier performance (F2 > 0.95). However, authorship robustness is fragile to paraphrasing (AI recall falls from 0.95 to 0.34 and F3 to 0.53 after rewording), indicating that classification depends on shallow features. Emotion intensity acts as a double-edged sword: high-intensity (high-arousal) emotions (e.g., "disgust") render AI authors more detectable, whereas moderate or subtle emotion categories confound AI-vs-human classifiers (Alsadhan, 24 Mar 2026).
4. Evaluation Protocols and Metrics
Speech synthesis studies employ three complementary axes: speech quality, identity consistency, and controllability.
- Naturalness is assessed via Mean Opinion Score (MOS) on a 1–5 scale, with 200 judgments per emotion and method.
- Speaker Consistency employs Speaker Encoder Cosine Similarity (SECS) between x-vectors of synthesized neutral and emotional utterances, where higher SECS reflects greater speaker identity preservation.
- Emotion Intensity Controllability is checked via a rearrangement test: listeners rank randomly ordered samples at varying 4 (0.1, 0.5, 0.9) by perceived intensity, with result accuracy given by the fraction of correctly identified sequences (Murata et al., 4 Jul 2025).
Text-based evaluations combine BERT-based binary (human vs. AI) and multiclass (emotion or personality categories) classifiers. Five-fold cross-validation provides macro-averaged F5 scores. Secondary linguistic analysis includes LIWC features (tone, authenticity, analytic versus narrative) and established readability indices (Flesch, Gunning Fog, ARI, Dale-Chall) (Alsadhan, 24 Mar 2026).
5. Empirical Results
Speech Synthesis (TTS)
The speaker-agnostic approach achieves high naturalness and authorship robustness across all transfer scenarios:
| Target Setting | MOS (Proposed) | SECS (Proposed) |
|---|---|---|
| Same-speaker | 3.78–3.83 | 0.85–0.88 |
| Cross-speaker (seen) | 3.69–3.92 | 0.78–0.80 |
| Cross-speaker (unseen) | 3.73–3.85 | 0.79–0.84 |
Emotion intensity controllability yields 0.74 rearrangement accuracy for seen and 0.67 for unseen speakers, well above chance (0.17) (Murata et al., 4 Jul 2025).
Text (LLMs)
| Condition | English F6 | Arabic F7 |
|---|---|---|
| Human vs. AI (orig) | 0.97 | 0.95 |
| No punctuation | 0.92 | 0.93 |
| Paraphrased | 0.53 | – |
Cross-domain emotion classification fails: human-trained model on AI test data yields F8, and vice versa F9. Within-model emotion classification for LLM outputs is highly consistent (e.g., GPT-4o: 0.95), indicating internal affective coherence divergent from human norms. High-arousal emotions enable stronger cross-domain signal (e.g., "disgust," GPT-4o: F0=0.75) (Alsadhan, 24 Mar 2026).
6. Linguistic Divergence and Psycholinguistic Characterization
LLM outputs systematically diverge from human writing in multiple dimensions. Authenticity scores exceed 90 for AI versus 40–60 for humans, reflecting an "overly self-disclosing" artificial style. Valence drift is observed in negative emotion outputs: for "anger/fear," human Tone ≈ 0, GPT-4o/DeepSeek Tone 30–90. AI-generated texts are substantially simpler, as shown by Flesch (GPT-4o: 87.6 vs. Human: 72.1) and Gunning Fog (GPT-4o: 3.9 vs. Human: 9.8) indices. Analytic style dominates AI outputs (Gemini/Mistral > 80) relative to human text (≈50). These stylistic cues are critical for current authorship attribution schemes (Alsadhan, 24 Mar 2026).
7. Implications, Limitations, and Directions
Both modalities demonstrate the importance of controlling emotion intensity without degrading authorship robustness. In TTS, parameter-space, speaker-agnostic vectors afford intensity scaling and identity preservation, including zero-shot transfer. In text, authorship classification is highly effective for non-paraphrased samples but degrades with paraphrase or in detecting subtle emotions. High-intensity emotional signals enhance classification separability due to their pronounced stylistic imprint, which may be leveraged or mitigated in future adversarial fine-tuning and data augmentation schemes.
In low-resource languages, augmenting datasets with synthetic AI-generated samples improves affective trait recognition, as shown by a 0.21 gain in F1 for Arabic personality recognition when combining human and AI data (Alsadhan, 24 Mar 2026). A plausible implication is that in domains where task data is scarce, combining real and synthetic sources under careful cross-domain validation can improve robustness.
Surface-based stylistic detectors remain brittle under adversarial transformation. Advanced authorship robustness will require integration of semantic, discourse, and psycholinguistic features for both affective computing and generative model governance.
References
- Speaker-agnostic emotion vector for cross-speaker emotion intensity control (Murata et al., 4 Jul 2025)
- Is AI catching up to human expression? Exploring emotion, personality, authorship, and linguistic style in English and Arabic with six LLMs (Alsadhan, 24 Mar 2026)