- The paper’s main contribution is a multilingual benchmark assessing SER model generalization in both in-domain and out-of-domain scenarios.
- It introduces a novel MLP that processes each feature frame independently to maintain temporal details and enhance prediction consistency.
- Benchmark results reveal that larger Whisper ASR models outperform dedicated SSL models, challenging conventional approaches in cross-lingual SER.
This paper, "SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition" (2408.07851), addresses the critical challenge of generalization in Speech Emotion Recognition (SER) models, particularly their performance across diverse languages and data distributions not seen during training. Existing SER benchmarks often focus on a limited number of datasets and primarily evaluate in-domain performance, which doesn't reflect the variability encountered in real-world applications.
To tackle this, the authors propose a large-scale, multilingual benchmark designed to rigorously evaluate SER models in both in-domain and out-of-domain (zero-shot cross-corpus) settings. The benchmark utilizes a curated collection of multilingual datasets (listed in Table 1), emphasizing less commonly used corpora to promote the development of more robust and adaptable models. Datasets are preprocessed, including setting a maximum audio length of 30 seconds, and emotion labels are mapped to a unified 8-class space to facilitate cross-dataset comparisons.
The evaluation framework employs several state-of-the-art speech representation models as backbones (detailed in Table 2), including self-supervised learning (SSL) models like wav2vec 2.0, HuBERT, WavLM, and MERT, as well as contrastive models like CLAP, and notably, Automatic Speech Recognition (ASR) models from the Whisper family.
A simple Multilayer Perceptron (MLP) with around 500K parameters is trained on top of the backbone features for emotion classification. A key implementation detail is that instead of averaging the features from an utterance before classification, the MLP is applied to each feature frame, and the predictions are then averaged. This method is reported to preserve more temporal information and yield stronger, more consistent results. The MLP is trained for 100 epochs with label smoothing (α=0.1).
To enable fair comparisons across datasets with varying class distributions, especially in out-of-domain scenarios, the benchmark incorporates logit adjustment during evaluation.
The performance of the models is assessed using macro-averaged F1 score across a matrix where rows are training datasets and columns are evaluation datasets. Metrics include:
- In-domain separability: Average performance when training and testing on the same dataset.
- Out-of-domain performance given training dataset: Average performance on eligible unseen datasets when trained on a specific dataset.
- Average performance on unseen datasets: Average performance on a specific dataset when the model was trained on any other eligible dataset.
- A weighted average metric combining aspects of in-domain and out-of-domain performance and variability.
A surprising key finding is that the Whisper ASR model, particularly the larger variants (Whisper-Large-v2, Whisper-Large, Whisper-Large-v3), consistently outperforms dedicated SSL models in cross-lingual SER tasks, demonstrating superior generalization capabilities. This challenges the common assumption that ASR models, focused on linguistic content, are suboptimal for paralinguistic tasks like SER.
The results also highlight significant variability in out-of-domain performance depending on the specific combination of training and testing datasets, suggesting that model robustness across diverse real-world conditions remains a challenge. Some datasets are inherently more difficult for models to generalize to when not trained on them.
For practitioners, the paper provides a valuable benchmark dataset collection and an open-source code base (serval
on GitHub) implementing the evaluation protocol. This allows others to reproduce the experiments, evaluate new models, and compare performance rigorously. The finding about Whisper's effectiveness suggests that leveraging pre-trained ASR models can be a strong baseline or starting point for cross-lingual SER applications, potentially reducing the need for extensive emotion-specific labeled data or complex model architectures.
The authors point to future work including advanced domain adaptation techniques, few-shot/meta-learning, and further investigation into dataset characteristics that impact generalization, which are all crucial areas for improving SER systems for real-world deployment.
The weighted average metric used for overall model ranking is calculated as:
Weighted Average=2Average OOD+Average ID−λfactor×2Std. Dev. OOD+Std. Dev. ID
where λfactor is set to 1.0. This metric penalizes high variability in performance across different train/test scenarios, favoring models that are more consistently robust.