UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022
Abstract: We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tests. Our system is based on ensemble learning of strong and weak learners. Strong learners incorporate several improvements to the previous fine-tuning models of self-supervised learning (SSL) models, while weak learners use basic machine-learning methods to predict scores from SSL features. In the Challenge, our system had the highest score on several metrics for both the main and OOD tracks. In addition, we conducted ablation studies to investigate the effectiveness of our proposed methods.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Practical Applications
Overview
Based on the UTMOS system for predicting mean opinion scores (MOS) in synthetic speech, the paper’s findings and methods (ensemble learning with strong/weak learners, contrastive loss, listener-dependent modeling, phoneme encoding, data augmentation, and domain-aware stacking) enable concrete applications for product quality assurance, research acceleration, multilingual deployment, and governance of AI-generated audio.
Below are actionable use cases grouped into Immediate Applications (deployable now) and Long-Term Applications (requiring further research, scaling, or development). Each item includes sector links, potential tools/workflows, and assumptions or dependencies that affect feasibility.
Immediate Applications
- Automated TTS quality assurance in CI/CD
- Sectors: software, media, customer service
- Tool/workflow: Integrate a UTMOS-style MOS predictor as an API or CI plugin to gate releases; use ensemble stacking for robust scores; track SRCC/MSE across builds and systems
- Assumptions/dependencies: Domain-specific calibration data; availability of pretrained SSL models (e.g., wav2vec2.0, HuBERT, WavLM); compute resources for inference; acceptable correlation to human MOS in the target domain
- A/B benchmarking for speech conversion/vocoder models
- Sectors: academia, research labs, speech tech vendors
- Tool/workflow: Batch-compute MOS for model variants; use contrastive loss-inspired ranking to detect regressions (pairwise comparisons increase sensitivity to SRCC/Kendall metrics)
- Assumptions/dependencies: Comparable test sets; consistent preprocessing (16 kHz, volume normalization); stacking benefits rely on model diversity
- VoIP/call-center synthetic agent QoE monitoring
- Sectors: telecom, finance, customer support
- Tool/workflow: Real-time or near-real-time MOS estimation at utterance and system levels; dashboards with SRCC/MSE trends; alerts on quality dips
- Assumptions/dependencies: Latency constraints and model compression for streaming; domain mismatch handling; listener-dependent training replaced by mean-listener inference
- Dataset curation and triage for TTS training
- Sectors: machine learning ops, data engineering
- Tool/workflow: Predict MOS for large audio corpora; filter low-quality utterances; prioritize annotation; balance systems and speakers by predicted quality
- Assumptions/dependencies: MOS predictor correlates with perceived quality for synthetic speech; robust cross-system generalization
- Intelligibility QA via phoneme/reference mismatch detection
- Sectors: education (e-learning), audiobooks, content production
- Tool/workflow: Use phoneme encoder with ASR-derived sequences and DBSCAN-based reference clustering to flag mismatches; report intelligibility risk per clip
- Assumptions/dependencies: ASR accuracy (xlsr-53 or similar) for target language; repeated prompts or clustered content; language coverage and phoneme set alignment
- Crowdsourcing-efficient MOS labeling to expand unlabeled data
- Sectors: academia, industry R&D
- Tool/workflow: Conduct small listening tests (e.g., ~2 ratings/utterance); calibrate externally collected scores to internal scales; blend into training (semi-supervised)
- Assumptions/dependencies: Strong correlation between small-scale ratings and ground truth; culturally/language-appropriate listener pools; ethical crowdsourcing practices
- Multilingual voice quality evaluation with domain-aware stacking
- Sectors: localization, global product operations
- Tool/workflow: Train per-domain weak learners (main/OOD/external), stack predictions; include domain IDs and mean-listener embeddings to reduce bias across languages/tests
- Assumptions/dependencies: Sufficient per-domain SSL features; phoneme encoders for target languages; labeled or partially labeled domain data
- Robustness improvements via safe audio augmentation
- Sectors: research, model training
- Tool/workflow: Apply speaking-rate and pitch-shift augmentations (WavAugment) within tuned ranges (, cents) to stabilize training in low-data regimes
- Assumptions/dependencies: Augmentation ranges that preserve perceived MOS; pipeline supports augmentation; no unintended artifacts
- Regression test harness using rank-focused evaluation
- Sectors: software QA for audio systems
- Tool/workflow: Create pairwise comparison suites; leverage contrastive loss margins to detect sign inversions (wrong rank order) between versions
- Assumptions/dependencies: Stable test sets; pair selection strategy; margin hyperparameters calibrated to domain
- Vendor certification and procurement due diligence for synthetic voice solutions
- Sectors: policy/compliance, procurement, media platforms
- Tool/workflow: Establish quality thresholds using system-level SRCC/MSE; require third-party MOS prediction reports in RFPs; audit OOD performance
- Assumptions/dependencies: Acceptance of automated MOS proxies; agreed-upon standards and reference datasets; periodic recalibration to context
- QC for speech restoration pipelines (noise suppression, dereverberation)
- Sectors: conferencing/meeting software, telecom
- Tool/workflow: MOS-based gates post-enhancement; continuous monitoring of pipeline health across versions and environments
- Assumptions/dependencies: MOS predictor trained or adapted to enhanced speech distributions; potential domain drift in real-world conditions
- Content platform quality gates for AI voice ads and announcements
- Sectors: advertising tech, transportation/retail PA systems
- Tool/workflow: Automated MOS check for uploaded synthetic audio; flag low-intelligibility pieces before publication
- Assumptions/dependencies: Platform-specific intelligibility criteria; multilingual ASR/phoneme support
Long-Term Applications
- General-purpose MOS prediction model across diverse domains and languages
- Sectors: software, standards bodies
- Tool/workflow: Large-scale training with domain IDs, listener-dependent modeling, phoneme encoders; public benchmarks beyond Blizzard/VC datasets
- Assumptions/dependencies: Expanded, diverse labeled datasets; robust cross-lingual ASR; ongoing bias audits
- Personalized MOS prediction aligned to target user cohorts
- Sectors: UX/localization, consumer products
- Tool/workflow: Listener-dependent embeddings tailored to demographic or preference profiles; adaptive calibration during inference
- Assumptions/dependencies: Privacy-preserving collection of listener attributes; models generalize beyond mean-listener; fairness considerations
- Real-time, on-device MOS inference for edge devices
- Sectors: robotics, IoT, wearables
- Tool/workflow: Compress strong learners (SSL+BLSTM) with quantization/distillation; enable local quality monitoring and adaptive voice adjustments
- Assumptions/dependencies: Hardware constraints; latency targets; accuracy retention under compression
- Regulatory standards for AI-generated voice quality and intelligibility
- Sectors: policy/regulation, media governance
- Tool/workflow: Standardized test suites and metrics (SRCC/Kendall/MSE) across domains; interoperability guidelines for MOS predictors
- Assumptions/dependencies: Broad stakeholder agreement; impartial validation; inclusion of accent/language fairness
- Closed-loop adaptive TTS that optimizes prosody to maximize predicted MOS
- Sectors: software, accessibility
- Tool/workflow: Use predicted MOS as feedback to adjust speaking rate/pitch/prosody (safe ranges); deploy auto-tuning for different content types
- Assumptions/dependencies: Differentiable or iterative control over TTS parameters; robust generalization; user preference modeling
- Healthcare applications in speech assessment (e.g., dysarthria intelligibility proxies)
- Sectors: healthcare, assistive tech
- Tool/workflow: Adapt MOS predictors to natural impaired speech using phoneme encoders and domain-aware training; monitor therapy progress
- Assumptions/dependencies: Clinical validation; careful domain shift from synthetic to clinical speech; ethical and privacy safeguards
- Automated pronunciation grading in language learning
- Sectors: education, EdTech
- Tool/workflow: Phoneme-based encoding and mismatch analysis for learner speech; combine MOS-like scores with ASR alignment for feedback
- Assumptions/dependencies: Pedagogically valid scoring; robust multilingual phoneme models; fairness across accents
- Quality-based marketplaces and pricing for synthetic voice assets
- Sectors: finance, media platforms
- Tool/workflow: MOS-driven tiers for voice models/clips; buyers informed by standardized quality metrics
- Assumptions/dependencies: Trust in scoring and transparency; anti-gaming measures; cross-vendor comparability
- Cross-lingual intelligibility benchmarking services
- Sectors: localization, international education
- Tool/workflow: Multilingual phoneme encoders and ASR coverage; domain- and listener-aware evaluation across languages
- Assumptions/dependencies: Extensive language support; varying script/phoneme systems; labeled data for calibration
- Data-efficient MOS labeling frameworks (semi-supervised + contrastive learning)
- Sectors: ML research, data ops
- Tool/workflow: Blend small human ratings with unlabeled audio using ranking-based objectives; scale with stacking and weak learners
- Assumptions/dependencies: Reliable small-scale ratings; robust semi-supervised procedures; domain drift management
Collections
Sign up for free to add this paper to one or more collections.