UTokyo-SaruLab MOS Prediction System

Updated 17 November 2025

The paper introduces UTMOS, a novel stacked ensemble that integrates SSL models and classical regressors to achieve top performance in predicting subjective MOS scores.
UTMOS employs a two-branch architecture, using strong learners from wav2vec 2.0 and diverse weak learners to capture detailed frame-level features and robust utterance representations.
Key innovations include listener-dependent embeddings, contrastive loss, and multi-stage stacking, which collectively enhance cross-domain MOS prediction and ranking accuracy.

The UTokyo-SaruLab Mean Opinion Score System (UTMOS) is an automatic mean opinion score prediction ensemble developed for the VoiceMOS Challenge 2022 by the UTokyo-SaruLab group. UTMOS combines fine-tuned self-supervised learning (SSL) speech models with classical machine learning regressors in a stacked ensemble to predict subjective speech quality scores (MOS) across both in-domain and out-of-domain (OOD) test conditions. By integrating strong frame-level models with diverse SSL feature-based regressors through multi-stage stacking, and by incorporating listener and phoneme representations, UTMOS achieved the highest marks on key metrics in both English and Chinese tracks of the competition.

1. System Architecture and Workflow

UTMOS employs a two-branch ensemble architecture constructed from “strong learners” and “weak learners.” The design supports both utterance-level and system-level MOS prediction. The system processes workflows as follows:

Input Preprocessing: Audio is resampled to 16 kHz and volume normalized.
Strong Learners: These are end-to-end models using a pretrained SSL backbone (wav2vec 2.0 base) that operates directly on raw audio, outputs frame-level features, and predicts a frame-wise MOS. Frame predictions are averaged to generate the utterance-level MOS estimate.
Weak Learners: SSL models (wav2vec2, HuBERT, WavLM) generate frame embeddings, which are mean-pooled to produce utterance vectors. These vectors are scored via lightweight regressors—ridge regression, SVR, kernel SVR, random forest, LightGBM, and Gaussian process regression.
Stacked Generalization: A three-stage stacking procedure aggregates predictions from all strong and weak models by training meta-learners (using the same regressor suite) over out-of-fold base predictions. Optionally, additional stacking can further smooth predictions.

This robust ensembling exploits both the detailed modeling capacity of deep neural strong learners and the diversity/robustness of weak learners based on statistical regression.

2. Strong Learner Details

Strong learners are built using the wav2vec 2.0 base model pretrained on LibriSpeech, fine-tuned to regress MOS at the frame level:

Frame-Scoring Head: Consists of a 2-layer bidirectional LSTM (256 hidden units) and a final linear output to produce scalar frame scores.
Loss Functions:
- Clipped MSE: $L^{reg}(y,\hat{y})=1_{|y-\hat{y}|>\tau}(y-\hat{y})^2$ , where $\tau=0.25$ .
- Pairwise Contrastive Loss: $L_{x_i,x_j}^{con} = \max(0, |(y_i-y_j) - (\hat{y}_i-\hat{y}_j)|-\alpha)$ , with margin $\alpha=0.5$ .
- Total Loss: $L = \beta L^{reg} + \gamma \sum_{i\neq j} L_{x_i,x_j}^{con}$ , with $\beta=1, \gamma=0.5$ .
Listener- and Domain-aware Embeddings: Each utterance incorporates a unique 128-dimensional listener embedding and a 128-dimensional domain embedding, which are concatenated to the BLSTM input. A “mean listener” embedding substitutes unknown listeners at inference.
Phoneme Encoder: An auxiliary BLSTM encodes both the ASR-derived and “reference” phoneme sequences (reference estimated via DBSCAN clustering on Levenshtein distances). The start and end hidden states (2×256) are concatenated to SSL features per frame.
Data Augmentation: WavAugment is used for random speed perturbation within $[0.9, 1.1]$ and pitch shifting by $\pm300$ cents.
External Label Incorporation: For the OOD track, 540 Chinese utterances were MOS-rated by 32 human listeners and appended to training.

Main architectural novelties compared to prior SSL-based MOS models include listener-dependent adaptation, explicit phoneme information, and the use of a contrastive loss oriented towards better ranking (SRCC).

3. Weak Learners and Meta-Ensembling

Weak learners exploit the diversity of SSL feature backbones and classical regressors:

SSL Feature Extraction: From wav2vec2, HuBERT, and WavLM, for a total of eight distinct models.
Pooling and Regression: Mean pooling over frames yields the utterance representation, which is then input to six regressors—ridge, linear SVR, kernel SVR, random forest, LightGBM, and Gaussian process regression. All models are trained on the utterance-level MOS.
Cross-Domain Structure: For OOD evaluation, weak learners are trained separately for each data domain (main, OOD, external), resulting in up to 144 weak-learned models.

The three-stage stacking process consolidates all predictions. Stage 2 meta-learners, built with the same regressor suite and trained on out-of-fold base outputs, produce the final MOS estimation. In the weighted ensemble view: $\hat{y}_{\text{final}}=\sum_{k=1}^{K} w_k \hat{y}_k, \sum_{k} w_k=1$ , with weights learned by the Stage 2 model.

4. Training Data, Hyperparameters, and Implementation

VoiceMOS Challenge Data:
- In-domain (Main, English): 4,974 utterances, 175 systems, 39,792 ratings for training. Development: 1,066, Test: 1,066 utterances.
- OOD (Chinese): 136 labeled (+540 unlabeled), 1,848 train ratings. Additional 540 utterances labeled by 32 native listeners (~2 ratings/utterance) were incorporated.
Preprocessing: MOS targets normalized to $[-1, 1]$ for strong learner training.
Optimization: Adam optimizer $(\beta_1=0.9, \beta_2=0.99)$ , batch size 12, gradient accumulation of 2, 15,000 steps, 4,000-step linear warmup, followed by linear decay.
Frameworks: Strong learners in PyTorch/fairseq, weak/meta learners in scikit-learn, LightGBM, GPyTorch; hyperparameter tuning via Optuna.
Hardware: Training was performed on GPU(s), roughly 4–6 hours per strong learner with additional time for cross-validation and stacking.

5. Evaluation Metrics and Challenge Results

Performance was assessed at both utterance and system level using:

Mean Squared Error (MSE): $MSE = \frac{1}{N} \sum_i (y_i - \hat{y}_i)^2$
Pearson Linear Correlation (LCC)
Spearman Rank Correlation (SRCC)
Kendall’s Tau (KTAU)

Summary of challenge results (Team T17):

Track	Utt-MSE	Utt-SRCC	Sys-MSE	Sys-SRCC
Main (EN)	0.165	0.897	0.090	0.936
OOD (ZH)	0.162	0.893	0.030	0.988

UTMOS obtained the best result in all but one metric, notably achieving first or nearly first on all key utterance- and system-level metrics for both Main and OOD tracks.

6. Ablation Studies and Insights

Systematic ablations on UTMOS strong learners demonstrated the contribution of individual components:

Listener-dependent modeling: Removing listener embeddings produced the steepest drop in OOD, confirming their importance for domain adaptation.
Contrastive loss: Enhanced ranking correlation; contrastive-only settings still maintained high SRCC.
Phoneme encoder: Provided modest but consistent gains, especially in OOD.
Data augmentation and external labels: Substantially improved results in low-resource OOD settings.
Model stacking: Simple ensembling of strong and weak learners with meta-learners consistently reduced MSE while maintaining high SRCC; even weak learners alone reached SRCC > 0.88.

This suggests that the power of aggregate features, SSL backbone diversity, and ensembling outweighs reliance on a single model architecture—especially in cross-domain MOS prediction.

7. Extension Opportunities and Recommendations

The modularity of UTMOS enables extension to new languages and evaluation domains via:

Addition of new language/domain embeddings for listener and domain adaptation.
Incorporation of additional SSL models (e.g., larger variants of WavLM, data2vec).
Targeted collection of MOS ratings in the target domain.
Advanced meta-learners or Bayesian ensembling for final MOS synthesis.

A plausible implication is that aggregating listening test data from diverse languages and conditions with domain- and listener-aware design could yield general-purpose automatic MOS prediction systems. Future work may pursue expanded pooling of listening data, more expressive embeddings, and further ranking-aware training objectives.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UTokyo-SaruLab Mean Opinion Score System (UTMOS).