MERaLiON-SER: Robust Multilingual SER
- MERaLiON-SER is a family of speech emotion recognition models designed for multilingual and fine-grained affect modeling using both categorical and dimensional representations.
- It integrates a frozen Whisper-Medium encoder with LoRA adapters, multiscale attention pooling, and modified ECAPA-TDNN blocks to capture short-term prosody and long-term context.
- Its parameter-efficient fine-tuning and superior empirical performance across diverse datasets make MERaLiON-SER ideal for integration into multimodal, emotion-aware systems.
MERaLiON-SER is a family of robust speech emotion recognition (SER) models whose architecture, training, and evaluation have been specifically designed for both English and Southeast Asian (SEA) languages. MERaLiON-SER systems distinguish themselves through hybrid loss design—jointly optimizing categorical and dimensional affect representations—extensive multilingual benchmarking, and modular architectures that exploit parameter-efficient fine-tuning for speech-only tasks. The models demonstrate marked performance improvements over open-source SER systems and large-scale generic Audio-LLMs, especially in multilingual and fine-grained scenarios.
1. Model Architecture
The primary MERaLiON-SER architecture integrates a frozen Whisper-Medium encoder as its backbone for acoustic feature extraction. This backbone is augmented by Low-Rank Adaptation (LoRA) adapters inserted into each multi-head attention layer (key, query, value) for parameter-efficient fine-tuning. The downstream stack incorporates multiscale/hierarchical attention pooling layers to aggregate both short-term prosodic cues and long-term context. Modified ECAPA-TDNN blocks, which replace BatchNorm with GroupNorm for minibatch stability, further propagate, aggregate, and emphasize channel attention features.
The output layer employs two parallel “emotion heads”:
- a categorical head (linear + softmax) for discrete emotion classes,
- and a dimensional head (linear + sigmoid) for continuous arousal, valence, and dominance scores.
Architecture Flow
1 2 3 4 5 6 7 8 9 10 |
Whisper-Medium Encoder
↓
LoRA Adapters
↓
Multiscale Attention Pooling
↓
ECAPA-TDNN with GroupNorm
↓ ↙ ↘
Softmax Head — Sigmoid Head
(Categorical) (Dimensional) |
The design emphasizes the capture and integration of both discrete and continuous representations, supporting comprehensive affect modeling. Multi-head outputs facilitate simultaneous optimization for classification and regression objectives.
2. Training Objectives
MERaLiON-SER employs a hybrid objective function to jointly perform discrete emotion classification and dimensional emotion regression, ensuring robust modeling along multiple axes of affect.
2.1 Weighted Categorical Cross-Entropy
To address class imbalance, the loss is weighted: with
where and are ground truth and predicted probabilities, is the count of class . Label smoothing () is applied to regularize targets.
2.2 Concordance Correlation Coefficient (CCC) Loss
For each continuous dimension: Loss for CCC:
2.3 Combined Training Objective
with , .
This joint objective ensures preservation of discrete emotion boundaries while capturing nuanced affective variation.
3. Datasets and Preprocessing
3.1 Training Corpora
- SG-ECMT (pseudo-labeled): 66,209 utterances across English (27,458), Chinese (14,212), Malay (14,370), and Tamil (10,169); labels via emotion2vec two-pass agreement, 10–30 s segments.
- SGTV (human-annotated): ∼117,000 utterances from Singapore TV/movies, containing English/Mandarin/code-mixed content; seven discrete labels.
- Public Datasets: CREMA-D, M3ED, ESD, MELD.
3.2 Evaluation Sets
- Manual SG-ECMT evaluation: 466–479 per language with majority-vote human labels.
- Benchmarks: MSP-Podcast, IEMOCAP (5-fold), MELD (English); M3ED (Chinese); IndoWaveSentiment (Indonesian).
3.3 Feature Extraction & Augmentation
- Audio: resampled to 16 kHz, 80-dim mel-spectrogram (25 ms window, 10 ms hop).
- Mean-variance normalization per utterance.
- On-the-fly augmentation: MixUp (, ), MUSAN additive noise, speed perturbation (0.9, 1.1).
Data Table — SG-ECMT
| Language | Samples | Labeling Method |
|---|---|---|
| English | 27,458 | Pseudo-labeled |
| Chinese | 14,212 | Pseudo-labeled |
| Malay | 14,370 | Pseudo-labeled |
| Tamil | 10,169 | Pseudo-labeled |
4. Training Protocol and Hyperparameter Selection
The training pipeline maintains the Whisper encoder weights frozen, optimizing only LoRA adapters and downstream modules. AdamW is used with two parameter groups:
- LoRA: learning rate , weight decay
- Downstream: learning rate , weight decay
Cosine annealing schedule is combined with linear warm-up (first 8% steps). Batch size is 32, total epochs 15, with early stopping based on dev categorical loss. Training used a single node with 8× NVIDIA H100 GPUs.
This careful division of learning rates and decoupling of parameter updates is calibrated to facilitate efficient adaptation while mitigating overfitting risks, particularly given multilingual and class-imbalanced data.
5. Evaluation Metrics and Empirical Results
5.1 Metrics
- Categorical: Unweighted Average Recall (UAR), equivalent to balanced accuracy; overall accuracy and F1 are in supplementary materials.
- Dimensional: Concordance Correlation Coefficient (CCC) for arousal, valence, dominance.
5.2 Multilingual Performance
SG-ECMT (Singapore languages, UAR %)
| Setting | 7-class 2 s | 7-class merged | 4-class 2 s | 4-class merged |
|---|---|---|---|---|
| MERaLiON-SER-v1 | 53.9 | 60.2 | 65.1 | 70.0 |
| rel. to emotion2vec-seed | +4.9 | +2.3 | +7.1 | +4.3 |
Public Benchmarks (UAR %)
- English (MSP, IEMOCAP, MELD): 64–70 %
- Chinese (M3ED), Indonesian (IndoWaveSentiment): 57–60 %
- Outperforms emotion2vec-seed by 4–6 UAR points.
- Audio-LLMs (MERaLiON-10B, SeaLLMs-Audio-7B) trail by 8–12 points.
- Proprietary multimodal LLMs (GPT-4o-Audio, Gemini-2.5-Flash): competitive for merged segments, underperform for fine-grained segmentation.
Dimensional CCC (mean across datasets):
- Arousal: 0.68
- Valence: 0.65
- Dominance: 0.62
MERaLiON-SER demonstrates clear empirical advantages in multilingual and fine-grained segmentation settings.
6. Analysis, Implications, and Applications
The empirical superiority of speech-only architectures over multimodal or ASR-oriented LLMs for SER is substantiated by several design features. Unlike Audio-LLMs or Whisper-style encoders tuned for ASR that deprioritize emotional cues, MERaLiON-SER's paralinguistic specialization, multiscale attention pooling, and GroupNorm use in ECAPA-TDNN modules collectively support robust affective signal extraction. LoRA-based fine-tuning ensures that emotionally salient features are maintained and adapted without overfitting full model parameters.
Consistent state-of-the-art results across English, Mandarin, Malay, Tamil, and zero-shot transfer to Indonesian illustrate robust cross-lingual generalization. The architectural paradigm—using a multilingual acoustic backbone with emotion-optimized downstream heads—successfully bridges prosodic variation arising from cultural or linguistic differences.
A salient application pathway is the integration of MERaLiON-SER as an “emotion perception module” within agentic and multimodal audio systems. In this role, multimodal agents can condition dialogue policies or generation on arousal/valence/dominance predictions, enabling contextually adaptive and empathetic response behaviors. A plausible implication is that such integration could allow conversational agents to “reason through” emotional state jointly with semantic content, enhancing user engagement and understanding.
Future work is suggested to focus on the tight coupling of MERaLiON-SER embeddings within multimodal LLM reasoning loops, further advancing the breadth and robustness of affective reasoning in artificial agents.
Summary
MERaLiON-SER exemplifies a speech emotion recognition architecture explicitly tuned for cross-lingual robustness and fine-grained affect modeling. Its backbone—composed of a frozen Whisper encoder, LoRA adapters, and multiscale pooling—paired with hybrid categorical/dimensional training, delivers state-of-the-art performance on both categorical and dimensional tasks, outperforming larger and less specialized open-source and proprietary systems. The modeling, training, and evaluation framework underscores the centrality of specialized, parameter-efficient architectures for paralinguistic understanding in multilingual and culturally diverse environments (Sailor et al., 7 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free