MERaLiON-SER: Robust Multilingual SER

Updated 11 November 2025

MERaLiON-SER is a family of speech emotion recognition models designed for multilingual and fine-grained affect modeling using both categorical and dimensional representations.
It integrates a frozen Whisper-Medium encoder with LoRA adapters, multiscale attention pooling, and modified ECAPA-TDNN blocks to capture short-term prosody and long-term context.
Its parameter-efficient fine-tuning and superior empirical performance across diverse datasets make MERaLiON-SER ideal for integration into multimodal, emotion-aware systems.

MERaLiON-SER is a family of robust speech emotion recognition (SER) models whose architecture, training, and evaluation have been specifically designed for both English and Southeast Asian (SEA) languages. MERaLiON-SER systems distinguish themselves through hybrid loss design—jointly optimizing categorical and dimensional affect representations—extensive multilingual benchmarking, and modular architectures that exploit parameter-efficient fine-tuning for speech-only tasks. The models demonstrate marked performance improvements over open-source SER systems and large-scale generic Audio-LLMs, especially in multilingual and fine-grained scenarios.

1. Model Architecture

The primary MERaLiON-SER architecture integrates a frozen Whisper-Medium encoder as its backbone for acoustic feature extraction. This backbone is augmented by Low-Rank Adaptation (LoRA) adapters inserted into each multi-head attention layer (key, query, value) for parameter-efficient fine-tuning. The downstream stack incorporates multiscale/hierarchical attention pooling layers to aggregate both short-term prosodic cues and long-term context. Modified ECAPA-TDNN blocks, which replace BatchNorm with GroupNorm for minibatch stability, further propagate, aggregate, and emphasize channel attention features.

The output layer employs two parallel “emotion heads”:

a categorical head (linear + softmax) for $C=7$ discrete emotion classes,
and a dimensional head (linear + sigmoid) for continuous arousal, valence, and dominance scores.

Architecture Flow

Whisper-Medium Encoder
    ↓
  LoRA Adapters
    ↓
Multiscale Attention Pooling
    ↓
ECAPA-TDNN with GroupNorm
    ↓        ↙           ↘
Softmax Head   —   Sigmoid Head
(Categorical)      (Dimensional)

The design emphasizes the capture and integration of both discrete and continuous representations, supporting comprehensive affect modeling. Multi-head outputs facilitate simultaneous optimization for classification and regression objectives.

2. Training Objectives

MERaLiON-SER employs a hybrid objective function to jointly perform discrete emotion classification and dimensional emotion regression, ensuring robust modeling along multiple axes of affect.

2.1 Weighted Categorical Cross-Entropy

To address class imbalance, the loss is weighted: $\mathcal{L}_{\mathrm{CE}} = -\,\sum_{i=1}^{C} w_i \;y_i\;\log\bigl(\hat y_i\bigr)$ with

$w_i = \frac{N_{\mathrm{total}}}{C\,N_i}$

where $y_i$ and $\hat y_i$ are ground truth and predicted probabilities, $N_i$ is the count of class $i$ . Label smoothing ( $\epsilon=0.1$ ) is applied to regularize targets.

2.2 Concordance Correlation Coefficient (CCC) Loss

For each continuous dimension: $\mathrm{CCC}(y,\hat y) = \frac{2\,\mathrm{cov}(y,\hat y)}{\mathrm{var}(y)\;+\;\mathrm{var}(\hat y)\;+\;(\mu_{y}-\mu_{\hat y})^2}$ Loss for CCC: $\mathcal{L}_{\mathrm{CCC}} = 1 - \mathrm{CCC}(y,\hat y)$

2.3 Combined Training Objective

$\mathcal{L}_{\mathrm{total}} = \lambda_{\mathrm{cat}} \mathcal{L}_{\mathrm{CE}} + \lambda_{\mathrm{dim}} \mathcal{L}_{\mathrm{CCC}}$

with $\lambda_{\mathrm{cat}}=1.0$ , $\lambda_{\mathrm{dim}}=0.5$ .

This joint objective ensures preservation of discrete emotion boundaries while capturing nuanced affective variation.

3. Datasets and Preprocessing

3.1 Training Corpora

SG-ECMT (pseudo-labeled): 66,209 utterances across English (27,458), Chinese (14,212), Malay (14,370), and Tamil (10,169); labels via emotion2vec two-pass agreement, 10–30 s segments.
SGTV (human-annotated): ∼117,000 utterances from Singapore TV/movies, containing English/Mandarin/code-mixed content; seven discrete labels.
Public Datasets: CREMA-D, M3ED, ESD, MELD.

3.2 Evaluation Sets

Manual SG-ECMT evaluation: 466–479 per language with majority-vote human labels.
Benchmarks: MSP-Podcast, IEMOCAP (5-fold), MELD (English); M3ED (Chinese); IndoWaveSentiment (Indonesian).

3.3 Feature Extraction & Augmentation

Audio: resampled to 16 kHz, 80-dim mel-spectrogram (25 ms window, 10 ms hop).
Mean-variance normalization per utterance.
On-the-fly augmentation: MixUp ( $p=0.5$ , $\alpha=0.3$ ), MUSAN additive noise, speed perturbation (0.9, 1.1).

Data Table — SG-ECMT

Language	Samples	Labeling Method
English	27,458	Pseudo-labeled
Chinese	14,212	Pseudo-labeled
Malay	14,370	Pseudo-labeled
Tamil	10,169	Pseudo-labeled

4. Training Protocol and Hyperparameter Selection

The training pipeline maintains the Whisper encoder weights frozen, optimizing only LoRA adapters and downstream modules. AdamW is used with two parameter groups:

LoRA: learning rate $5 \times 10^{-5}$ , weight decay $4 \times 10^{-5}$
Downstream: learning rate $6 \times 10^{-4}$ , weight decay $8 \times 10^{-5}$

Cosine annealing schedule is combined with linear warm-up (first 8% steps). Batch size is 32, total epochs 15, with early stopping based on dev categorical loss. Training used a single node with 8× NVIDIA H100 GPUs.

This careful division of learning rates and decoupling of parameter updates is calibrated to facilitate efficient adaptation while mitigating overfitting risks, particularly given multilingual and class-imbalanced data.

5. Evaluation Metrics and Empirical Results

5.1 Metrics

Categorical: Unweighted Average Recall (UAR), equivalent to balanced accuracy; overall accuracy and F1 are in supplementary materials.
Dimensional: Concordance Correlation Coefficient (CCC) for arousal, valence, dominance.

5.2 Multilingual Performance

SG-ECMT (Singapore languages, UAR %)

Setting	7-class 2 s	7-class merged	4-class 2 s	4-class merged
MERaLiON-SER-v1	53.9	60.2	65.1	70.0
rel. to emotion2vec-seed	+4.9	+2.3	+7.1	+4.3

Public Benchmarks (UAR %)

English (MSP, IEMOCAP, MELD): 64–70 %
Chinese (M3ED), Indonesian (IndoWaveSentiment): 57–60 %
- Outperforms emotion2vec-seed by 4–6 UAR points.
- Audio-LLMs (MERaLiON-10B, SeaLLMs-Audio-7B) trail by 8–12 points.
- Proprietary multimodal LLMs (GPT-4o-Audio, Gemini-2.5-Flash): competitive for merged segments, underperform for fine-grained segmentation.

Dimensional CCC (mean across datasets):

Arousal: 0.68
Valence: 0.65
Dominance: 0.62

MERaLiON-SER demonstrates clear empirical advantages in multilingual and fine-grained segmentation settings.

6. Analysis, Implications, and Applications

The empirical superiority of speech-only architectures over multimodal or ASR-oriented LLMs for SER is substantiated by several design features. Unlike Audio-LLMs or Whisper-style encoders tuned for ASR that deprioritize emotional cues, MERaLiON-SER's paralinguistic specialization, multiscale attention pooling, and GroupNorm use in ECAPA-TDNN modules collectively support robust affective signal extraction. LoRA-based fine-tuning ensures that emotionally salient features are maintained and adapted without overfitting full model parameters.

Consistent state-of-the-art results across English, Mandarin, Malay, Tamil, and zero-shot transfer to Indonesian illustrate robust cross-lingual generalization. The architectural paradigm—using a multilingual acoustic backbone with emotion-optimized downstream heads—successfully bridges prosodic variation arising from cultural or linguistic differences.

A salient application pathway is the integration of MERaLiON-SER as an “emotion perception module” within agentic and multimodal audio systems. In this role, multimodal agents can condition dialogue policies or generation on arousal/valence/dominance predictions, enabling contextually adaptive and empathetic response behaviors. A plausible implication is that such integration could allow conversational agents to “reason through” emotional state jointly with semantic content, enhancing user engagement and understanding.

Future work is suggested to focus on the tight coupling of MERaLiON-SER embeddings within multimodal LLM reasoning loops, further advancing the breadth and robustness of affective reasoning in artificial agents.

Summary

MERaLiON-SER exemplifies a speech emotion recognition architecture explicitly tuned for cross-lingual robustness and fine-grained affect modeling. Its backbone—composed of a frozen Whisper encoder, LoRA adapters, and multiscale pooling—paired with hybrid categorical/dimensional training, delivers state-of-the-art performance on both categorical and dimensional tasks, outperforming larger and less specialized open-source and proprietary systems. The modeling, training, and evaluation framework underscores the centrality of specialized, parameter-efficient architectures for paralinguistic understanding in multilingual and culturally diverse environments (Sailor et al., 7 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MERaLiON-SER.

MERaLiON-SER: Robust Multilingual SER

1. Model Architecture

2. Training Objectives

2.1 Weighted Categorical Cross-Entropy

2.2 Concordance Correlation Coefficient (CCC) Loss

2.3 Combined Training Objective

3. Datasets and Preprocessing

3.1 Training Corpora

3.2 Evaluation Sets

3.3 Feature Extraction & Augmentation

4. Training Protocol and Hyperparameter Selection

5. Evaluation Metrics and Empirical Results

5.1 Metrics

5.2 Multilingual Performance

SG-ECMT (Singapore languages, UAR %)

Public Benchmarks (UAR %)

Dimensional CCC (mean across datasets):

6. Analysis, Implications, and Applications

Summary

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

MERaLiON-SER: Robust Multilingual SER

1. Model Architecture

2. Training Objectives

2.1 Weighted Categorical Cross-Entropy

2.2 Concordance Correlation Coefficient (CCC) Loss

2.3 Combined Training Objective

3. Datasets and Preprocessing

3.1 Training Corpora

3.2 Evaluation Sets

3.3 Feature Extraction & Augmentation

4. Training Protocol and Hyperparameter Selection

5. Evaluation Metrics and Empirical Results

5.1 Metrics

5.2 Multilingual Performance

SG-ECMT (Singapore languages, UAR %)

Public Benchmarks (UAR %)

Dimensional CCC (mean across datasets):

6. Analysis, Implications, and Applications

Summary

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research