EmoNet-Voice Bench: SER Evaluation
- EmoNet-Voice Bench is a synthetic benchmark for speech emotion recognition that integrates over 35 hours of expert-labeled, multi-language audio across 40 emotion categories.
- It employs a rigorous annotation protocol where psychology experts rate clip intensities on a three-point scale, ensuring fine-grained and reliable emotion detection.
- Evaluation using the EmpathicInsight-Voice model shows state-of-the-art performance, demonstrating robust correlation with human perception across both high- and low-arousal states.
EmoNet-Voice Bench is a comprehensive evaluation resource for speech emotion recognition (SER) and speech emotion detection, designed to address the lack of granularity, privacy, and real-world representativeness in existing SER benchmarks. It combines a large, synthetic, and privacy-preserving dataset—EmoNet-Voice Big (4,500+ hours, 40 emotions, 4 languages)—with a rigorously curated benchmark split, EmoNet-Voice Bench, composed of human expert-verified intensity labels on fine-grained synthetic audio (Schuhmann et al., 11 Jun 2025).
1. Dataset Structure and Taxonomy
EmoNet-Voice Bench comprises 12,600 unique synthetic audio clips totalling approximately 35.8 hours, generated using high-fidelity text-to-speech engines simulating actors enacting short scenes engineered to evoke one of 40 distinct emotion categories. These categories are derived from a taxonomy originally designed for EmoNet-Face, and span a wide spectrum, including high-arousal affective states (e.g., anger, teasing, embarrassment), low-arousal states (e.g., concentration, emotional numbness), and nuanced cognitive-emotional conditions (e.g., contemplation). Distribution by language is nearly balanced across English (49%), German (15%), Spanish (17%), and French (19%), and uses 11 synthetic voices for diversity (6 F, 5 M).
All clips are synthesized explicitly to model acting, enabling inclusion of emotionally-sensitive states and avoiding privacy concerns common to real-world datasets. This approach also ensures a controlled mapping between prompt intent and audio content, supporting precise benchmarking of model recognition capabilities.
2. Annotation Protocol and Intensity Measurement
Annotation is performed exclusively by psychology experts (each with a Bachelor’s or higher in Psychology) on a per-clip, per-emotion basis. For each synthetic audio snippet, experts rate the perceived intensity of the target emotion using a three-point scale:
Intensity Score | Semantic Meaning |
---|---|
0 | Not perceived |
1 | Mildly present |
2 | Strongly present |
Ratings are performed independently by two experts per clip, with a third (and optionally fourth) expert annotator providing tie-breaker or confirmation inputs. Annotators are blinded to peer labels. Quality control is maintained through balanced gender assignments and hierarchical review of ambiguous cases. The resulting annotation matrix spans over 33,600 single-emotion labels.
For benchmarking, labels are mapped to a continuous scale (0–10) for regression and ranking tasks, facilitating quantitative assessment of model predictions against human perception. Inter-annotator reliability is reported using Cronbach’s α, with overall value of 0.14—interpreted as a reflection of the intrinsic complexity of fine-grained emotion judgments.
3. Synthetic Audio Generation Principles
The dataset is constructed using GPT-4 OmniAudio, utilizing a prompting strategy that casts the system as an "actor auditioning for a film," explicitly instructed to portray one of the 40 emotion categories with strong initial expression and natural speech rhythm. Prompts are crafted to elicit prosodic, dynamic, and intonational markers, and segment durations range from 3 to 30 seconds in WAV (24kHz). This controlled synthesis enables expressivity across the full taxonomy, including emotions considered ethically challenging (e.g., shame, deep sadness), circumventing privacy concerns and actor recruitment constraints.
4. Benchmark Evaluation Metrics and Model Assessment
EmoNet-Voice Bench enables regression and ranking-based evaluation of SER models. Statistical metrics include:
- Mean Absolute Error (MAE):
- Root Mean Squared Error (RMSE):
- Pearson’s and Spearman’s for correlation with expert scores
- Cronbach’s for inter-annotator reliability:
Evaluation using the EmpathicInsight-Voice Large model yields state-of-the-art performance, e.g., , , , indicating moderate-to-strong predictive alignment with human perception across most categories. Model consistency and robustness are tested across diverse emotions, highlighting systematic performance differences between high-arousal categories (easier to detect) and low-arousal states (markedly harder for models).
Comparison against commercial models (e.g., GPT‑4o, Hume Voice) reveals high refusal rates for sensitive categories, underpinning the utility of the privacy-preserving, synthetic approach in EmoNet-Voice Bench.
5. Technical Framework: EmpathicInsight-Voice Model
EmpathicInsight-Voice models serve as a strong baseline for emotion recognition in speech:
- Whisper-based encoder pre-trained on the full EmoNet-Voice Big corpus for robust acoustic feature extraction.
- Ensemble MLP “expert” predictors—one for each emotion category—trained to regress perceived intensity for each dimension independently.
- Parameter counts: Small (74M), Large (148M), reflecting varying MLP sizes; encoder weights frozen during expert head training.
This architecture supports fine-grained, multi-emotion regression and exhibits high throughput and scalability, with minimal safety-related refusals and optimal error/correlation rates on the EmoNet-Voice Bench.
6. Impact, Applications, and Research Significance
EmoNet-Voice Bench establishes a new standard for SER by:
- Enabling systematic evaluation across 40 emotion categories at varying intensities, with multi-language, multi-voice coverage.
- Providing rigorous, expert-verified annotation and privacy-preserving data, enabling inclusion of stigmatized or sensitive emotions absent from natural recordings.
- Facilitating the development of context-aware, multimodal, and fine-grained emotion recognition systems, with empirical benchmarking tailored to both high- and low-arousal/cognitively-rich states.
- Supporting robust, transparent assessment of model strengths and gaps, stimulating improvement in real-world, emotionally intelligent AI applications in HCI, TTS, mental health, affective computing, and social robotics.
Key findings demonstrate the persistent challenge of recognizing subtle affective states and the utility of controlled synthesis for benchmarking. EmoNet-Voice Bench, together with the EmpathicInsight-Voice models, provides a scalable, transparent, and ethically sound foundation for next-generation SER research and deployment.