Emotion Controllable Speech Synthesis

Updated 29 September 2025

Emotion controllable speech synthesis is a method that integrates explicit emotion control into TTS systems using annotated datasets and deep learning architectures.
It employs techniques like latent embedding spaces, diffusion models, and hierarchical control to achieve fine-grained modulation of emotional expression.
Ongoing research addresses challenges such as data scarcity, disentanglement of features, and balancing naturalness with effective emotion control.

Emotion controllable speech synthesis refers to the design of Text-to-Speech (TTS) systems capable of producing synthetic speech with attributes of emotional expressiveness that are parametrically, explicitly, or implicitly controllable. These systems aim to generate not only intelligible and natural-sounding speech but also speech that can be modulated to express a specific target emotion, nuanced emotion style, or variable emotional intensity. The evolution of this research area has been characterized by the emergence of new datasets, annotation and representation methodologies, integration of emotion control into deep learning architectures, and the development of robust training/fine-tuning protocols that balance expressiveness and speech quality.

1. Foundations: Datasets and Emotional Annotation

An essential prerequisite for emotion controllable speech synthesis is the availability of speech corpora annotated with emotional categories or dimensions. Efforts such as EmoV-DB introduced carefully collected datasets featuring recordings from both male and female actors, spanning languages such as English and French, and covering multiple emotion classes including amusement, anger, disgust, sleepiness, and neutral (Tits, 2019). The dataset design includes sentence-aligned transcriptions, high signal-to-noise ratio, and precise categorical labels, often derived from canonical sources (e.g., CMU-Arctic, SIWIS).

Beyond such focused emotional corpora, the field also relies on general-purpose datasets (e.g., LJ-Speech, LibriSpeech, LibriTTS), which are less expressive but more abundant, and on hybrid approaches that use emotion-specific resources (e.g., RAVDESS, CREMA-D, Berlin EmoDB) for targeted tasks. The scarcity and imbalance of high-quality emotional speech datasets—especially for non-neutral affect—remain acute challenges, motivating data augmentation, transfer learning, and annotation strategies that recycle or augment existing resources.

Automatic annotation techniques emerged to tackle the lack of reliable fine-grained emotion labels. Here, transfer learning from ASR models plays a central role: a deep network trained for ASR (e.g., based on 15-layer dilated convolutions, WaveNet-style skip connections, input MFCCs) is repurposed to extract intermediate embeddings later used for emotion recognition with greater accuracy than handcrafted features (e.g., eGeMAPS, OpenSMILE) (Tits, 2019). Advanced dimension reduction (e.g., UMAP) and latent space modeling allow researchers to interpret the relationship between latent embeddings and acoustic correlates such as F0_mean (fundamental frequency), using linear surfaces in latent space as proxies for prosodic feature gradients.

2. Latent Representations and Emotion Embedding Spaces

Controllable emotional synthesis requires an explicit or implicit mapping from the desired emotion to the latent space of the generative model. Several embedding strategies have been adopted:

Categorical Embeddings: Directly encoding emotion categories by one-hot vectors or learnable lookup tables (global render) (Lei et al., 2020, Lei et al., 2022).
Continuous Emotion Spaces: Using dimensions such as arousal, valence, and dominance (AVD/ADV) (Cho et al., 12 Jun 2024, Liu et al., 15 May 2025) or pleasure, arousal, dominance (PAD) (Zhou et al., 25 Sep 2024), often mapped to spherical or uniform grid representations for disentangling style (angular coordinates) and intensity (radial coordinate).
Disentangled Latent Spaces: Semi-supervised or adversarial training (e.g., with domain adversarial training and gradient reversal (Kang et al., 2023)) ensures the separation of speaker and emotion features, facilitating zero-shot adaptation and linear control.
Ranking-Based and Relative Attribute Methods: Phoneme- or syllable-level ranking functions use margin-based optimization to learn fine-grained descriptors of emotion strength, satisfying inter-class (emotion category) and intra-class (intensity) ranking constraints (Lei et al., 2020, Wang et al., 2023).
Mixture and Interpolation Schemes: Interpolating between neutral and emotional clusters (linear or spread-aware) (Um et al., 2019, Oh et al., 2022), or using Mixup augmentations and bucketing schemes to derive continuous or stepwise control over emotion intensity.

These approaches allow synthesis models to modulate global emotion, mirror utterance-level prosody, and effect fine-grained, segmental adjustments to emotional expression.

3. Deep Learning Architectures and Conditioning Mechanisms

Emotion control is realized in a variety of neural architectures, with representative designs including:

Sequence-to-Sequence Acoustic Models: DCTTS, Tacotron, Tacotron2, and their variants serve as the backbone for integrating emotion embeddings into the decoding process (Tits, 2019, Li et al., 2020, Lei et al., 2022). Conditioned generation is often achieved by concatenation or addition of emotion-related embeddings with the encoder outputs or variance adaptation paths.
Diffusion and Flow-Matching Models: Recent works leverage conditional diffusion processes to better model the distribution of natural speech (EmoMix (Tang et al., 2023), ZET-Speech (Kang et al., 2023), EmoCtrl-TTS (Wu et al., 17 Jul 2024)). Here, conditioning on emotion embeddings or continuous arousal/valence trajectories enables both zero-shot and time-varying control.
Hierarchical and Multi-Scale Control: Modular frameworks like MsEmoTTS introduce dedicated modules for global (emotion class), utterance-level (prosody), and local (syllable/phoneme) control (Lei et al., 2022). Hierarchical ED extractors model emotional intensity at phoneme, word, and utterance levels and enable direct manipulation during inference (Inoue et al., 17 Dec 2024).
Prompt- and LLM-based Frameworks: Instruction-following architectures, such as Emotion-aware LLM-TTS in Emo-DPO (Gao et al., 16 Sep 2024), and prompt-based systems incorporating learned or selected emotional prompts (EmoPro) (Wang et al., 27 Sep 2024), further extend controllability to natural language input or human-like dialogue interfaces.
Neural Codec LLMs and Conditional Vocoders: Integration of neural codecs with emotional conditioning—using autoregressive LMs, conditional flow-matching, and emotion mixture encoders—enables both discrete and dimensional (ADV/PAD) control across diverse datasets with semi-supervised annotation strategies (Liu et al., 15 May 2025).
Articulatory and Prosody-Informed Modules: The GTR-Voice framework introduces control via glottalization, tenseness, and resonance dimensions, directly manipulating voice quality at the phonetic and production level (Li et al., 15 Jun 2024). Explicit prosody regulators align rhythm and intonation with emotional cues (Qi et al., 27 May 2025).

4. Training Strategies, Fine-Tuning, and Evaluation

Robust emotion-controlled synthesis leverages:

Transfer Learning and Fine-Tuning: Neutral TTS models are pre-trained on large-scale, neutral or weakly labeled datasets and then fine-tuned on smaller, emotion-rich corpora. A two-step fine-tuning regime (neutral subset, then specific emotion classes) preserves intelligibility while augmenting expressive power (Tits, 2019).
Domain Adaptation and Semi-Supervised Learning: Cross-domain models employ MMD-based regularization to align feature spaces across SER and TTS datasets, increasing reliability of soft emotion labels and enabling joint optimization with auxiliary emotion classification losses (Cai et al., 2020, Oh et al., 2022, Liu et al., 15 May 2025).
Direct Preference Optimization (DPO): Pairwise training on preferred vs. non-preferred emotional outputs using sigmoid-transformed likelihood ratios, with regularization via Jensen-Shannon divergence, improves the model's discriminative capacity for subtle emotional variations (Gao et al., 16 Sep 2024).
Prosody and Acoustic Supervision: Multi-objective losses incorporate mel-spectrogram reconstruction, pitch and duration control, speaker similarity (e.g., via F₀ constraints), naturalness (MOS, DNSMOS), and emotion classification/recognition feedback.
Prompt Selection and Model-in-the-Loop Evaluation: Two-stage prompt selection strategies—involving static acoustic/textual filtering and dynamic semantic relevance—systematically select high-quality emotional prompts that maximize emotional expressiveness, text-emotion alignment, and technical fidelity in synthesized speech (Wang et al., 27 Sep 2024).

Objective and subjective evaluation metrics include intelligibility (WER), naturalness (MOS/nMOS), emotional expressiveness (emotion classifier accuracy, SMOS), prosodic correlation (pitch/energy RMSD, F0 contour analysis), and testable controllability (listener ranking, transition detection, correlation with target ADV/PAD values).

5. Fine-Grained, Multi-Scale, and Time-Varying Emotional Control

Advancements go beyond global emotion assignment to enable segmental and trajectory-based emotion specification:

Phoneme/Syllable-Level Control: Phoneme-aligned ranking functions or normalized descriptors provide granular modulation of emotion strength at the sub-utterance level and can be learned through relative ranking or directly predicted from text (Lei et al., 2020, Lei et al., 2022, Wang et al., 2023).
Utterance-Level Prosody Variation: Dedicated modules or predictors capture the distribution of emotional prosody (intonation, F0 movement) across sentences, aligning intra-utterance dynamics with affective goals (Lei et al., 2022).
Continuous, Disentangled Emotional Intensity: Uniform grid geometries or spherical coordinates in latent embedding spaces permit smooth interpolation and accurate intensity control, resolving entanglement with speaker and linguistic factors (Oh et al., 2022, Cho et al., 12 Jun 2024).
Dynamic and Nonverbal Vocalization Embeddings: Frame-wise control of emotional attributes (arousal, valence) and nonverbal cues (e.g., laughter, crying) enables synthesis of speech with rapidly shifting or expressive, multimodal affect (Wu et al., 17 Jul 2024).
Mixed Emotion Synthesis and Manual Intensity Scaling: Diffusion-based methodologies, such as in EmoMix, combine condition signals from multiple emotions at runtime for nuanced, blended affective output (Tang et al., 2023). Scalar intensity scaling and interpolation in the emotion embedding vector yield perceptible gradations in emotional strength (Li et al., 2020, Lei et al., 2022).

6. Limitations, Challenges, and Future Prospects

Principal ongoing challenges include:

Data Scarcity and Imbalance: High-quality, balanced emotional speech corpora for all affect categories are rare, leading to overfitting and poor coverage of atypical or blended emotions (Tits, 2019, Liu et al., 15 May 2025).
Disentanglement and Annotation Reliability: Emotional, speaker, and linguistic attributes tend to entangle in latent spaces, especially in small or weakly labeled datasets. Automatic and semi-supervised annotations (from SER classifiers, pseudo-labeling) introduce noise and uncertainty, making reliable emotion labeling challenging (Cai et al., 2020, Oh et al., 2022, Wu et al., 17 Jul 2024).
Naturalness—Controllability Trade-off: Enhanced emotion control may degrade core speech quality (e.g., intelligibility, naturalness), and parameter choices (e.g., strength of classifier guidance in diffusion models, balance parameters in DPO) require careful tuning (Kang et al., 2023, Wu et al., 17 Jul 2024).
Granularity and Real-time Control: Most systems operate at sentence or word level; ongoing work explores real-time, streaming, and phoneme-level manipulation for conversational and interactive use (Inoue et al., 17 Dec 2024, Qi et al., 27 May 2025).

Research directions outlined in the literature include integration with VAE-based systems for separation of style, speaker, and emotion (Tits, 2019), multi-speaker and cross-lingual generalization (Um et al., 2019, Kang et al., 2023), fully end-to-end joint training for emotion prediction (Lei et al., 2022), and extension to other expressive modalities such as emotional voice conversion (Qi et al., 27 May 2025).

7. Implications and Outlook

Emotion controllable speech synthesis is now characterized by a diverse methodological landscape: from modular embedding architectures and hierarchical emotion distribution modeling to diffusion-based, zero-shot, and prompt-driven frameworks. These systems enable real-world applications ranging from personalized voice assistants, emotionally engaging media production, and educational tools to assistive technologies and therapeutic tools.

The field is progressing toward emotion-adaptive interfaces that not only generate emotionally expressive speech but also support user- or context-driven adjustment of affective nuance, granularity, and trajectory. The application of interpretable vector spaces (e.g., spherical AVD, ADV) and semi-supervised training regimes promises scalable and principled control, while leveraging foundation models (LLM-TTS, large-scale SER, self-supervised pretraining) is broadening access to nuanced, context-aware emotional speech generation. These developments lay the groundwork for expressive, human-like, and communicatively effective artificial speech systems.