EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection (2506.09827v2)

Published 11 Jun 2025 in cs.CL and cs.AI

Abstract: The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces EmoNet-Voice, offering two datasets—EmoNet-Voice Big with 4500+ hours of synthetic speech and EmoNet-Voice Bench with expert annotations.
The methodology leverages state-of-the-art TTS synthesis and refined Whisper encoders with regression heads to capture detailed emotion intensity.
Empirical evaluations show superior recognition of high-arousal emotions while revealing challenges in accurately detecting lower-arousal, cognitive states.

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Introduction

This paper introduces EmoNet-Voice, a novel multi-faceted resource for Speech Emotion Recognition (SER) aimed at addressing core limitations present in existing datasets such as insufficient emotional granularity, lack of representativeness, and restricted scalability. Two complementary datasets are provided: EmoNet-Voice Big, a synthetic corpus for pretraining models on a broad spectrum of 40 emotion categories and multiple languages, and EmoNet-Voice Bench, a tightly curated benchmark with psychology expert annotations assessing both emotional presence and intensity.

EmoNet-Voice Big provides over 4,500 hours of emotion-labeled synthetic speech data across diverse voices and languages (English, German, Spanish, French), generated via state-of-the-art text-to-speech models. This synthetic approach allows privacy-conscious inclusion of sensitive emotions typically absent in SER datasets. EmoNet-Voice Bench comprises 12,600 audio samples annotated by experts to ensure heightened taxonomy fidelity.

Building upon these datasets, EmpathicInsight-Voice models achieve remarkable agreement with human experts, particularly proficient in recognizing high-arousal emotions like anger, outperforming existing methods. Evaluations reveal critical insights into the capabilities of modern SER paradigms, emphasizing an evident performance gap with lower arousal states like concentration.

Dataset Construction

Emotion Taxonomy

EmoNet-Voice leverages a 40-category taxonomy capturing a diverse emotional palette, enabling nuanced distinction beyond basic emotions. This taxonomy includes positive, negative, cognitive, and socially-mediated emotions, aligning with constructionist theories in affective science.

EmoNet-Voice Big

Using the powerful GPT-4 OmniAudio model, EmoNet-Voice Big synthesizes a large collection of emotionally expressive voice samples to simulate authentic prosody and timbre. Audio snippets range from 3 to 30 seconds, exploring diverse linguistic and gender identities. This dataset, massively scalable and multilingual, serves as an open-resource for extensive SER and TTS research.

EmoNet-Voice Bench

EmoNet-Voice Bench is a benchmark annotated by human experts, providing nuanced multi-label emotion ratings. This annotation framework is structured with a rigorous consensus protocol to establish reliable intensity markers across its 40-category taxonomy. The detailed agreement distribution is visually depicted (Figure 1).

Figure 1: Annotator agreement for human ratings on perceived emotions in audio samples, highlighting the distribution of agreement types.

EmpathicInsight-Voice: State-of-the-art SER Models

EmpathicInsight-Voice models represent a pivotal enhancement in SER capabilities, overcoming limitations of existing architectures. Designed using continuously trained Whisper encoders with specialized MLP regression heads, these models leverage the comprehensive EmoNet-Voice Big dataset in two stages. Initially, generalized emotional representations are built, followed by training focused regression heads to optimize emotion intensity prediction based on multi-dimensional audio embeddings.

Experiments: Model Performance Evaluation

Our evaluations on EmoNet-Voice Bench indicate EmpathicInsight-Voice's superiority over existing SER models (Gemini, GPT-4o, Hume Voice), particularly in low-refusal and high correlation metrics (Table 1). Significantly, performance on high-arousal emotions is consistently stronger compared to low-arousal or cognitive emotional states which remain challenging.

Figure 2: Instructions given to the human annotator for the expert annotation of EmoNet-Voice Bench.

Discussion

The relationship between annotator agreement and model performance underscores the inherent complexity in subjective emotional recognition tasks. Notably, high-arousal emotions tend to produce distinctive acoustic cues that are captured adeptly by current models. Contrarily, cognitive emotions, reliant on nuanced context not immediate prosodic features, warrant advancements in context-aware modeling frameworks. Recognizing these biases is crucial for real-world applications.

Conclusion

EmoNet-Voice provides a rigorous foundation for SER advancement through its expansive synthetic dataset and expert-validated benchmark. It holds potential for evolving next-generation SER systems and facilitating expressive virtual interactions. Future research should aim to integrate contextual attributes and explore robust multimodal processing, ultimately bridging cognitive recognition gaps in AI-assisted human communication.

References

Schuhmann et al., "EmoNet-Face," 2025.
Radford et al., "Whisper," 2023.
Barrett, L. F., "Theory of Constructed Emotion," 2017.

PDF Markdown

Follow-up Questions

Related Papers

Authors (9)

Tweets

https://twitter.com/ResearchBitesAI/status/1936076700070351254

YouTube

Show All Videos