AudioMOS Challenge 2025
- AudioMOS Challenge 2025 is an international competition benchmarked to predict human subjective quality across various synthetic audio types.
- It evaluates three tracks—text-to-music, universal aesthetics, and synthetic speech with multiple sampling rates—using metrics like SRCC and MSE.
- Innovative methods such as Gaussian-softened ordinal classification and sampling-rate robust architectures have significantly advanced audio perceptual evaluations.
The AudioMOS Challenge 2025 is the inaugural international competition targeting automatic subjective quality prediction for synthetic audio, encompassing text-to-music, universal audio aesthetics, and synthetic speech at multiple sampling rates. Designed for both academic and industrial research communities, the challenge sought to establish rigorous, listener-aligned benchmarks for machine learning models assessing audio generation systems. Its methodology, datasets, metrics, and outcomes mark a substantial advance beyond traditional objective evaluation techniques.
1. Background and Motivation
The development of generative audio models—including text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA) systems—has accelerated the need for reliable automatic evaluation methods that reflect human perception. Prior metrics such as Fréchet Audio Distance and objective distortion measures exhibit insufficient correlation with human mean opinion scores (MOS), often being sensitive to implementation details and lacking cross-modal validity.
The AudioMOS Challenge 2025 directly advances the field by focusing exclusively on predicting human subjective judgment across diverse synthetic audio types. This initiative extends the legacy of VoiceMOS (Huang et al., 2022), which previously highlighted challenges in generalization, domain adaptation, and fine-grained listener modeling, into broader generative audio domains and higher perceptual fidelity.
2. Challenge Structure and Datasets
The challenge comprised three distinct tracks, each representing a realistic and technically demanding evaluation scenario:
Track | Domain & Evaluation Target | Dataset Design |
---|---|---|
1 | Text-to-Music: Musical Quality & Textual Alignment | MusicEval dataset: 2,748 mono audio clips (16.62h), 13,740 ratings, 384 prompts, 21 TTM/TTA systems |
2 | Universal Aesthetics (PQ, PC, CE, CU axes) | AES-Natural ~4,000 samples; ratings on 4 axes by 10 experts; test sets: TTS (LibriTTS-P), TTA/TTA, TTM |
3 | Synthetic Speech, Multi-rate MOS Prediction | 400 audio samples from TTS/vocoder/SR systems at 16/24/48kHz, 4 listening tests, 20 listeners |
- Track 1 measured both overall musical impression and alignment between machine-generated music and text prompts, rated by professional musicians on a 5-point Likert scale.
- Track 2 adopted the Meta Audiobox Aesthetics paradigm, using four axes: Production Quality (PQ), Production Complexity (PC), Content Enjoyment (CE), and Content Usefulness (CU). The test set encompassed speech, music, and mixed audio samples from various synthesis pipelines.
- Track 3 required predicting MOS across variable sampling rates (16, 24, 48 kHz), with ground-truth ratings acquired from both condition-constant and mixed-frequency human listening tests.
The evaluation focused on both utterance-level and aggregate system/condition-level performance. Primary metrics included Mean Squared Error (MSE), Linear Correlation Coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC), and Kendall's Tau (KTAU), with system-level SRCC as the principal ranking criterion (Huang et al., 1 Sep 2025).
3. Evaluation Methodologies and Baselines
A rigorous evaluation protocol was imposed:
- Split: Data partitioned into 70%/15%/15% training/dev/test splits (Track 1).
- Metrics: System-level SRCC (crucial for deployment and benchmarking) was the primary selection metric, reflecting models’ abilities to order entire systems/runs by perceptual quality.
- Baselines: Each track was seeded with competitive baseline models.
- Track 1: Cross-modal encoders (e.g., HTSAT for audio, RoBERTa for text) with dual MLP regression heads using L1 loss.
- Track 2: Pretrained WavLM model with learnable layer aggregation, four independent MLPs for multi-axis regression, trained with both MAE and MSE losses.
- Track 3: SSL-MOS model (16 kHz-only input) fine-tuned on MOS data, audio in other rates was downsampled for this baseline.
Innovative submissions frequently improved upon these by exploiting modern model architectures, multi-representation fusion, and rank-sensitive learning losses.
4. Innovations and Representative Systems
Key advances observed among top-ranked systems included:
- Ordinal Modeling Strategies: Several systems recast the regression problem as multi-class classification, discretizing the MOS range into bins and employing Gaussian label smoothing. For example, ASTAR-NTU’s winning Track 1 system used a dual-branch architecture with pre-trained MuQ (audio) and RoBERTa (text) encoders, cross-attention for feature fusion, and a Gaussian-softened cross-entropy loss: with as the true score and bin centers. This resulted in system-level SRCC = 0.991 for music impression and 0.952 for text alignment, a 21–31% improvement over baseline (Ritter-Gutierrez et al., 14 Jul 2025).
- Sampling Rate Robustness: Track 3 solutions addressed the limitations of fixed-rate SSL encoders. MambaRate (Kakoulidis et al., 16 Jul 2025) exploited pre-computed WavLM embeddings at various layer depths with a lightweight selective state space model and Gaussian RBF output layer, yielding low-bias MOS estimates across 16/24/48 kHz. Another submission introduced a sampling-frequency-independent convolutional layer, generating consistent filters for any frequency by deriving digital weights from a learnable analog filter function modeled with neural analog filters and random Fourier features (Nishikawa et al., 19 Jul 2025). Knowledge distillation from fixed-rate SSL models and listener ID conditioning further boosted performance.
- Multi-Scale and Ensemble Modeling: Many teams fused multiple SSL representations (e.g., CLAP, WavLM, wav2vec 2.0) or ensembled distinct architectures (e.g., Transformers, GR-KAN, multi-scale CNNs), improving both overall and axis-specific metrics, especially for less objective axes such as content enjoyment and usefulness.
These strategies led to top systems in Track 1 exceeding baseline performance by 20–30% in system-level SRCC for both musical quality and alignment, and in Track 2 several teams outperforming baseline models trained on comparably large data (Huang et al., 1 Sep 2025).
5. Analysis of Results and Error Patterns
Detailed results highlighted persistent challenges:
- Track 1: System-level ordering was accurately captured by several models, but utterance-level prediction remained more challenging, with lower absolute correlation for individual samples. Models incorporating Gaussian label smoothing and cross-attention fusion showed increased robustness to out-of-distribution prompts or genres.
- Track 2: The four-axis aesthetic framework exposed greater subjectivity in CE and CU scores, with higher inter-annotator variance and more difficulty for models to rank systems consistently compared to PQ and PC.
- Track 3: Conditions at 16 kHz, especially those involving super-resolution and neural vocoding, presented the highest absolute ranking errors. Downsampling for baseline models highlighted the importance of preserving high-frequency information. Sampling-frequency-independent layers substantially reduced error for mixed-rate prediction.
The table below summarizes primary error patterns and the most challenging evaluation conditions:
Track | Difficult Scenarios | Noted Error Patterns |
---|---|---|
1 | Out-of-domain genres/prompts, low-rated systems | Lower utterance-level SRCC, misalignment for rare prompt–music pairs |
2 | CE/CU axes, subjective judgments | High variance across annotators, inconsistent system-level ranks |
3 | 16 kHz & super-resolved, neural-vocoder outputs | Higher ranking errors, performance drops on downsampled or regenerated audio |
6. Impact, Research Implications, and Future Directions
The AudioMOS Challenge 2025 validated the feasibility of human-aligned, automatic perceptual audio evaluation across multiple modalities and technical conditions. Its outcomes have several significant implications:
- Research Benchmarks: By curating and releasing standardized datasets (e.g., MusicEval, AES-Natural with axis-specific scores, multi-rate synthetic speech), the challenge establishes authoritative benchmarks for future model development (Huang et al., 1 Sep 2025).
- Methodological Advances: Gaussian-softened ordinal classification and sampling-frequency-robust encoding architectures demonstrated measurable improvements and are likely to become standard practice in perceptual model design.
- Deployment: The emphasis on system-level SRCC and rigorous, listener-derived benchmarks aligns model rankings with usage in quality monitoring, model selection, and real-world generative audio evaluation scenarios.
- Open Evaluation: Integration with public evaluation platforms such as CodaBench fosters reproducibility and broad adoption.
Persistent challenges—such as utterance-level prediction, robust out-of-domain generalization, and subjective axis modeling—remain active areas for research. The challenge signals a transition towards more nuanced, human-centric evaluation metrics and architectures, with anticipated follow-up studies on cross-modal fusion, listener-aware modeling, and adaptation to novel generative paradigms.
7. Relationship to Broader Audio and Speech Assessment Research
AudioMOS 2025 extends prior challenges and tasks in the audio evaluation and representation learning ecosystem:
- From VoiceMOS to AudioMOS: VoiceMOS (Huang et al., 2022) established approaches for self-supervised fine-tuning and domain adaptation in MOS prediction for synthetic speech, particularly emphasizing the need for robust modeling under unseen system and listener conditions. AudioMOS expands this paradigm to text-to-music and multi-rate audio, integrating more rigorous subjective axes.
- Alignment with Encoder Benchmarks: The focus on cross-modal and robust representation learning ties to trends in encoder capability benchmarking (Zhang et al., 25 Jan 2025), especially regarding generalizability in real-world and multi-task settings.
- Synergies with Multimodal and Reasoning Challenges: Methodological techniques and error analysis in AudioMOS inform wider challenges in audio-language reasoning (Yang et al., 12 May 2025) and multimodal diarization/recognition (Gao et al., 20 May 2025), particularly regarding evaluation methodologies and data curation.
The challenge serves as a catalyst for convergence between generative modeling, perceptual assessment, and universal audio understanding, accelerating methodological unification in the field of human-aligned audio AI.