AudioMOS Challenge 2025 Dataset
- AudioMOS Challenge 2025 Dataset is a comprehensive benchmark for objective assessment of audio quality and text alignment in music and speech.
- The dataset features controlled experiments with diverse audio types and detailed MOS annotations for both generative audio and speech evaluations.
- Technical innovations include multimodal fusion, ordinal-aware labeling with Gaussian smoothing, and sampling-frequency-independent feature extraction.
The AudioMOS Challenge 2025 Dataset is a comprehensive benchmark designed to support rigorous evaluation and development of automatic audio quality and alignment assessment systems. It serves as the core data resource for the AudioMOS Challenge 2025, which aims to facilitate reproducible, data-driven comparison among state-of-the-art models in tasks such as perceptual quality (Mean Opinion Score, MOS) prediction and multimodal alignment evaluation. The dataset covers diverse audio types—including music, speech, and multi-domain sounds—making it applicable to research in audio signal processing, machine learning, and human-computer interaction.
1. Dataset Scope and Core Tasks
The AudioMOS Challenge 2025 Dataset was constructed to address two primary challenges: objective prediction of music impression and text alignment (Track 1), and robust MOS prediction for speech across multiple sampling frequencies (Track 3) (Ritter-Gutierrez et al., 14 Jul 2025, Nishikawa et al., 19 Jul 2025). It includes curated and annotated audio clips generated under controlled experimental protocols to ensure statistical reliability and broad applicability. Tasks supported by the dataset are:
- Music Impression (MI) and Text Alignment (TA) Prediction: Evaluating the intrinsic quality of music and the alignment between a natural language prompt and the generated composition, targeted at text-to-music generation systems.
- Speech MOS Prediction: Estimating subjective naturalness of speech at 16, 24, and 48 kHz, including the challenge of model generalization to unseen sampling rates.
These tasks are specifically tailored to benchmark systems' capacity to generalize, capture perceptual phenomena, and integrate multimodal signals.
2. Dataset Construction and Annotation Protocol
The dataset comprises audio clips (music and speech) and text prompts, with detailed annotations collected by expert and non-expert listeners. The annotation and curation pipeline features:
- Audio Generation: For MI/TA, clips generated by leading text-to-music models, ensuring diversity in genre, instrumentation, timbre, and compositional complexity; for MOS, both synthetic and natural speech spanning a wide range of recording and synthesis conditions.
- Prompt Design: Human-created prompts used for alignment with generated audio to support text-to-music evaluation.
- Subjective Evaluation: All clips scored on a 1–5 MOS scale (with decimal precision), using standardized rating procedures and listener balancing for robust statistical estimation.
- Sampling Frequency Variation: Speech samples prepared at 16, 24, and 48 kHz, enabling evaluation of models' frequency-robustness.
This systematic approach enables accurate ground truth estimation and mitigates data biases, critical for downstream benchmarking.
3. Evaluation Metrics and Normalization
Challenge tracks leveraging the dataset employ multiple, rigorously defined metrics:
Task | Primary Metric(s) | Scaling/Normalization |
---|---|---|
MI/TA (Music) | SRCC (system- and utterance-level), KTAU | Prediction over 20 MOS bins (1–5); Gaussian-soft label smoothing |
Speech MOS | MSE, SRCC, LCC, KTAU (all levels) | Target scores normalized: [−1, +1] |
Metrics are normalized for cross-task aggregation using formulas such as
and final scores are computed as
where is the test set size for task .
Binning and smoothing of MOS targets (1–5 mapped into 20 equal-width bins, softened with a Gaussian kernel) encode score ordinality, facilitating optimization for rank-based objectives (Ritter-Gutierrez et al., 14 Jul 2025).
4. Technical Contributions and Modeling Advances
The dataset’s construction has catalyzed several modeling innovations:
- Multimodal Fusion: The ASTAR-NTU system employs a dual-branch architecture (MuQ for audio, RoBERTa for text) with cross-attention to align modalities for TA prediction; transformer-based temporal modeling and attention pooling aggregate frame-level features for MI (Ritter-Gutierrez et al., 14 Jul 2025).
- Ordinal-Aware Labeling: Reframing regression as classification over MOS bins with smooth target distributions preserves ranking information and aligns learning with evaluation metrics.
- Sampling-Frequency-Independent (SFI) Feature Extraction: For speech MOS, integration of SFI convolutional layers—realized via parameterized digital filters constructed from neural analog filters (using random Fourier features)—enables robust feature extraction across varying sampling rates (Nishikawa et al., 19 Jul 2025).
- Listener Conditioning: Listener identity is embedded and concatenated with features to enable personalized MOS predictions.
- Knowledge Distillation: Distillation from pretrained SSL models on fixed sampling rates to SFI-SSL students provides effective initializations without prohibitive computational cost (Nishikawa et al., 19 Jul 2025).
- Multi-Domain Generalizability: Frame- and utterance-level embeddings required by challenge tasks encourage encoders to produce transferable universal representations (Zhang et al., 25 Jan 2025).
These technical strategies are directly enabled by the dataset's structure and annotation protocol.
5. Real-World Applications and Broader Impacts
The design of the AudioMOS Challenge 2025 Dataset directly reflects real-world use cases:
- Objective Assessment of Generative Audio: Facilitates rapid, scalable, and reproducible evaluation of TTM (Text-to-Music) and TTS (Text-to-Speech) systems, reducing reliance on expert listeners (Ritter-Gutierrez et al., 14 Jul 2025).
- Multimodal System Integration: Enables benchmarking of audio encoders for downstream integration into LLMs and multimodal models where continuous audio embeddings are needed (Zhang et al., 25 Jan 2025).
- Industrial-Scale Evaluation: Inclusion of concealed, industry-sourced datasets (e.g., car interior sounds, subway announcements) provides evaluation under noisy, variable conditions akin to production deployments (Zhang et al., 25 Jan 2025).
- Transfer Learning and Clustering: The multi-task, multi-domain output structure supports unsupervised clustering and transfer learning scenarios, as well as applications in security, accessibility, and analytics.
A plausible implication is that high-fidelity, frequency-robust speech scoring models and multimodal music evaluators arising from this dataset stand to benefit commercial and research audio systems beyond the scope of the original challenge.
6. Analytical Findings and Ablation Insights
Analysis of models benchmarked on the dataset reveals several insights:
- Transformer-based sequential modeling combined with attention pooling yields the strongest performance for music-related impression tasks, outperforming alternative temporal or pooling modules (Ritter-Gutierrez et al., 14 Jul 2025).
- The ordinal-aware Gaussian label smoothing is critical for achieving state-of-the-art rank correlation metrics, as cross-entropy losses on one-hot bins underperform (Ritter-Gutierrez et al., 14 Jul 2025).
- SFI convolutional layers in SSL models enhance system-level correlation for multi-frequency speech MOS, but may reduce utterance-level performance in some configurations—suggesting further refinement is needed for fine-grained assessment (Nishikawa et al., 19 Jul 2025).
- Ensembling diverse model variants (DORA-MOS, CORAL-based, decoupled) via stacking with linear meta-learners provides near-optimal results without overfitting, as evidenced by Ridge Regression outperforming LightGBM on held-out data (Ritter-Gutierrez et al., 14 Jul 2025).
These findings, grounded in ablation and comparative studies, define current best practices and ongoing research questions for this problem domain.
7. Position Among Audio Evaluation Benchmarks
The AudioMOS Challenge 2025 Dataset complements and advances past benchmarks by:
- Providing multifaceted, multi-task evaluation of both generative and discriminative audio modeling.
- Integrating rigorous statistical annotation methods for both expert-based and crowd-sourced ground truth.
- Encompassing diverse modalities and real-world tasks, moving beyond static or unimodal datasets toward challenges rooted in practical, high-impact scenarios.
It thereby provides a robust resource for the next generation of audio evaluation research, with its methodological advances likely to inform dataset construction and benchmarking standards for future challenges (Zhang et al., 25 Jan 2025, Ritter-Gutierrez et al., 14 Jul 2025, Nishikawa et al., 19 Jul 2025).