EmoSpeech: Emotion Recognition & Synthesis

Updated 7 March 2026

EmoSpeech is a domain for emotional speech analysis, employing models, annotated datasets, and fusion techniques for recognition, synthesis, and editing.
It utilizes robust architectures, including multimodal and speech-only systems, achieving high accuracies (up to 94% F1) in emotion recognition.
Emotion-controllable TTS models leverage fine-grained conditioning, spherical embeddings, and adversarial strategies to enhance expressiveness and naturalness.

EmoSpeech encompasses a range of models, techniques, and resources designed for the recognition, analysis, synthesis, and editing of emotional speech. This domain spans speech emotion recognition (SER), emotion-controllable text-to-speech (TTS), emotional speech data resources, and cross-lingual emotion transfer pipelines, with technical developments targeting nuanced expressiveness, controllability, and robustness across diverse environments, languages, and speakers.

1. Emotional Speech Data Resources and Annotation

High-quality annotated datasets are foundational for both SER and emotional TTS. Early emotional speech databases often employed discrete labels (e.g., "happy," "sad"), which constrained the expressive variability and granularity achievable in synthesis and recognition. Recent efforts have shifted towards datasets with continuous or detailed natural language emotion annotations, aiming to provide richer training signals for modeling subtle emotional states (Bian et al., 2024).

Notable corpora and their labeling protocols include:

Indian EmoSpeech Command Dataset: ~8,000 utterances from 250 Indian-accent speakers collected "in the wild" with 4 emotion classes (calm, fearful, angry, happy) and diverse real-world noise, supporting both verbal commands and non-verbal acoustic backgrounds (Banga et al., 2019).
EMOVIE: Mandarin emotional speech with 9,724 utterances rated on a five-level emotion polarity scale ([–1.0, ... +1.0]) by multiple annotators, capturing nuanced affect (Cui et al., 2021).
EmoVoice-DB: ~120 hours (English) + 80 hours (Mandarin) with both categorical and free-form text emotion descriptions attached to each utterance, covering 8 emotion classes and designed for open-ended natural language emotion prompting (Yang et al., 17 Apr 2025).
EmoSphere-TTS/EmoSphere++: Use continuous arousal, valence, dominance (AVD or VAD) pseudo-labels derived from pretrained speech emotion recognizers to enable fine-grained, continuous control over emotion style and intensity without human annotation (Cho et al., 2024, Cho et al., 2024).

The movement towards natural language and continuous or spherical emotion annotation significantly broadens the representational space for emotion-controllable models while reducing manual annotation costs, with large-scale, multi-speaker, multi-language coverage supporting more robust and generalizable systems (Yang et al., 17 Apr 2025, Bian et al., 2024).

2. Emotion Recognition in Speech: Architectures and Performance

SER models process raw or feature-extracted audio to determine the underlying emotional state. Architectures range across convolutional, recurrent, and transformer-based frameworks and increasingly leverage multimodal fusion (audio + text).

Representative systems:

EmoTech employs parallel BiLSTM and CNN pathways for both MFCC-based audio features and token-embedded text, concatenated for final classification. On IEMOCAP (5 emotions), this approach achieves 83.5% overall accuracy, exceeding prior multimodal baselines by 8 points (Avro et al., 22 Jan 2025).
Qieemo demonstrates that speech-alone, when processed via a pretrained ASR (Conformer) backbone with specialized multimodal fusion (MMF) and cross-modal attention (CMA) modules, outperforms both unimodal and text+audio models for conversational ERC (IEMOCAP, WA=76.4%, UA=77.7%), illustrating the effectiveness of ASR-derived, frame-aligned representations for emotion extraction (Chen et al., 5 Mar 2025).
BSC-UPC at EmoSPeech-IberLEF2024 applies attention pooling over concatenated SSL speech (XLSR-wav2vec 2.0) and text (XLM-RoBERTa) embeddings, achieving 86.7% Macro F1 on Spanish MEACorpus (6 emotions), with a strong ablation demonstrating the superiority of this modality fusion and multilingual pretraining (Casals-Salvador et al., 2024).
EmoAra utilizes a CNN-based SER module as the first step in a cross-lingual pipeline, attaining 94% F1 across 8 emotions on RAVDESS (Hassan et al., 1 Feb 2026).

Empirical trends indicate that multimodal approaches, effective use of SSL representations, and advanced fusion/attention mechanisms are critical for high accuracy and cross-domain generalization, with speech-only pipelines now competitive with, or exceeding, multimodal setups when leveraging strong pretraining (Chen et al., 5 Mar 2025, Casals-Salvador et al., 2024).

3. Emotion-Controllable Text-to-Speech: Modeling Strategies

Emotional TTS models produce speech whose prosody, timbre, and rhythm are modulated to reflect specific emotions—ranging from categorical labels to continuous affective states or even free-form natural language prompts.

Architectural and conditioning strategies include:

FastSpeech2-based EmoSpeech: Augments FS2 with eGeMAPS predictors for low-level prosodic cues, Conditional Layer Normalization (CLN), Conditional Cross-Attention (CCA) for phoneme-level intensity modulation, and adversarial fine-tuning with joint conditional–unconditional (JCU) discriminators, yielding substantial gains in MOS and emotion recognition accuracy (MOS=4.37, emotion recognition=0.85 on ESD) relative to plain FS2 (Diatlova et al., 2023).
EmoSphere-TTS and EmoSphere++: Introduce spherical (polar) emotion embeddings, where arousal, valence, and dominance (AVD/VAD) pseudo-labels from a pretrained SER backbone are centered and mapped into spherical coordinates. This enables explicit disentanglement of emotion intensity (radius) and style (angles), with continuous and interpretable control for both seen and unseen speakers. The multi-level style encoder and orthogonality loss in EmoSphere++ further regularize zero-shot generalization. MOS 3.9–3.98 and emotion classification accuracy 93.5–94.6% on ESD test (Cho et al., 2024, Cho et al., 2024).
LLM-based approaches (EmoVoice, Emo-DPO): Accept natural-language emotion prompts, decomposed into dense embeddings by LLMs using chain-of-thought (CoT) reasoning, and employ two-stage (phoneme → audio) or parallel (phoneme-boost) generation regimes. Preference optimization objectives (DPO) in LLM-TTS architectures (Emo-DPO) force nuanced modeling of emotional distinctions, yielding superior MOS and preference ratings over prior strong baselines (e.g., Emotion SIM: 98.87%, MOS: +0.3–0.5 over EmoSpeech) (Yang et al., 17 Apr 2025, Gao et al., 2024).
EME-TTS: Explicitly models the interaction between emotion and local emphasis, introducing emphasis perception enhancement (EPE) blocks to maintain target emphasis under different emotions. Listener studies confirm improved expressiveness and stable emphasis perception, especially when emphasis positions are guided by LLM-predicted cues (Li et al., 16 Jul 2025).

A key finding is that models incorporating fine-grained, interpretable conditioning (e.g., token-level intensity, spherical vectors, natural language reasoning) and leveraging explicit adversarial or preference-based optimization generally achieve higher expressiveness, recognizability, and controllability (Diatlova et al., 2023, Cho et al., 2024, Yang et al., 17 Apr 2025, Gao et al., 2024).

4. Text-Based Emotional Speech Editing and Cross-Lingual Emotion Transfer

Emotion-selectable speech editing enables post hoc specification or alteration of emotion in existing speech recordings, with implications for personalized dialogue, audio post-production, and accessibility.

Emo-CampNet extends the CampNet speech editing architecture with explicit emotion conditioning and a neutral content generator trained adversarially to strip source emotion, supporting one-shot editing of unseen speakers. Global and local emotion can be controlled, and data augmentations (e.g., F₀ perturbation) enhance generalization. Evaluation shows that both objective (e.g., MCD) and subjective metrics (MOS, ABX) improve significantly with the full framework (SER accuracy up to 76% for edited regions) (Wang et al., 2022).
EmoAra operationalizes end-to-end emotion-preserving cross-lingual communication: English audio is processed via SER (CNN-based), transcribed, translated to Arabic, and synthesized with emotion-conditioned TTS, enabling high-fidelity emotion transfer (SER F1=94%, BLEU 56 for MT, MOS under future investigation) (Hassan et al., 1 Feb 2026).

Such systems rely on high-performing SER modules, robust separation and injection of emotion tokens or embeddings, and effective augmentation/regulation strategies to ensure naturalness and emotional fidelity in both monolingual and cross-lingual settings.

5. Evaluation Metrics, Controllability, and Model Analysis

Quantitative and subjective metrics are critical for benchmarking and guiding advances in EmoSpeech systems. Widely used criteria include:

Objective: Emotion classification accuracy (ECA), word error rate (WER), MOS (naturalness, emotional expressiveness), Mel-Cepstral Distortion (MCD), embedding cosine similarity (for speaker and emotion), SVAS and EECS for spherical vector similarity, pitch RMSE, voicing F1, and error rates in speaker verification (EER) (Cho et al., 2024, Cho et al., 2024, Diatlova et al., 2023).
Subjective: Listener judgements on emotion recognizability, naturalness, preference ABX tests, and fine-grained ranking for expressiveness and alignment with human perception (Yang et al., 17 Apr 2025, Gao et al., 2024, Li et al., 16 Jul 2025).
Automated Multimodal LLM Assessment: GPT-4o-audio and Gemini are used for automated evaluation of emotional speech, with strong correlation to human ratings (Pearson r≥0.85 for both MOS and emotional fidelity) and efficiency gains in large-scale assessment (Yang et al., 17 Apr 2025).

State-of-the-art systems provide continuous or categorical control over emotion dimensions, either by adjusting spherical coordinates ([r, θ, φ]), editing embedding vectors, or specifying natural language prompts, with LLMs performing inference-time semantic parsing or CoT reasoning for prompt translation into prosodic targets (Cho et al., 2024, Cho et al., 2024, Yang et al., 17 Apr 2025). Adversarial and orthogonality-based regularization further support stable, fine-grained controllability and robust generalization to unseen speakers or emotions.

6. Limitations and Open Research Directions

Key limitations cited in the literature include:

Restricted coverage to a limited set of discrete emotions or insufficient control granularity; ongoing work aims at continuous or multidimensional affect spaces and phoneme/word-level modulation (Cho et al., 2024, Cho et al., 2024, Gao et al., 2024).
Manual annotation cost and scalability: Pseudo-labeling with pretrained SER models reduces cost, but edge cases and annotation noise, particularly in high-dimensional affect spaces, remain (Bian et al., 2024, Cho et al., 2024).
Cross-lingual and unseen speaker generalization: Zero-shot TTS with strong speaker and emotion disentanglement is an active area, with dedicated regularization and multi-level encoding strategies improving performance (Cho et al., 2024).
Quality and expressivity under strong prosodic deformation: Maintaining perceived naturalness and stable local emphasis during aggressive emotion or emphasis transformation is tackled with innovations such as EPE blocks and controlled variance features (Li et al., 16 Jul 2025).
Automating trustworthy evaluation at scale: Multimodal LLMs show strong promise, but require further validation across more diverse datasets, languages, and affect categories (Yang et al., 17 Apr 2025).
Fusion of emotion with other speech properties (e.g., emphasis, speaker identity) requires careful modeling to avoid leakage or loss of control (Li et al., 16 Jul 2025, Cho et al., 2024).

Future directions highlighted are the extension of controllability mechanisms to more dimensions and temporal scales, the integration of emotion and emphasis modeling at all representation levels, more comprehensive cross-lingual evaluation, advanced automated assessment, and further synergies with instruction-following speech-LLMs (Wang et al., 2024, Li et al., 16 Jul 2025, Cho et al., 2024, Yang et al., 17 Apr 2025).