Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
122 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
48 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

AudioMOS 2025: Benchmark for T2M Evaluation

Updated 16 July 2025
  • AudioMOS 2025 Challenge is an international benchmark for automatic MOS prediction in text-to-music generation, evaluating both musical impression (MI) and text alignment (TA).
  • It employs dual-branch encoders and cross-attention mechanisms to capture independent audio and text features, achieving significant SRCC improvements over baseline methods.
  • The challenge promotes scalable, expert-independent assessments, enhancing the reliability of generative audio quality and semantic consistency evaluations.

The AudioMOS 2025 Challenge is an international benchmark centered on the automatic prediction of Mean Opinion Scores (MOS) for music and audio generated from text prompts, with a strong emphasis on both the inherent quality of the musical output—termed Music Impression (MI)—and the semantic alignment between the textual prompt and the generated audio—termed Text Alignment (TA). The challenge is motivated by the need for scalable, expert-independent systems capable of evaluating the fidelity and relevance of text-to-music generation, an increasingly important class of generative audio models.

1. Objectives and Scope

AudioMOS 2025 aims to advance methodologies for automated MOS prediction, focusing specifically on text-to-music (T2M) evaluation. The challenge introduces two core tasks:

  • Music Impression (MI): Predicting the perceived musical quality of synthesized audio without reference to the input prompt.
  • Text Alignment (TA): Estimating how well the generated audio matches the semantics, content, and emotional cues of the provided text prompt.

By formalizing these sub-tasks, AudioMOS 2025 catalyzes research on systems that jointly consider audio-text relationships and the complex, subjective nature of aesthetic music evaluation. The challenge explicitly seeks models that are robust, generalizable, and aligned with human perception, providing high-throughput alternatives to expert-based subjective evaluation paradigms (Ritter-Gutierrez et al., 14 Jul 2025).

2. Task Formulation and Evaluation Metrics

AudioMOS 2025 Track 1 adopts a formulation where the model receives a text prompt and its corresponding generated music, and outputs predictions for both MI and TA scores. The principal evaluation metric is Spearman's Rank Correlation Coefficient (SRCC) between predicted and ground-truth MOS values, computed system-wide. This metric reflects the alignment between model-based and human rankings, prioritizing ordinal consistency over absolute distance:

SRCC=16i=1ndi2n(n21)\mathrm{SRCC} = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2-1)}

where did_i is the difference in rank between the model prediction and the ground-truth for sample ii.

To handle the ordinal nature of MOS and mitigate the limitations of regression-based approaches, leading solutions reframe MOS prediction as a classification over discretized, ordinal score bins, incorporating Gaussian label smoothing so adjacent bins reflect similar perceived quality (Ritter-Gutierrez et al., 14 Jul 2025).

3. Methodological Advances: The ASTAR-NTU System

The winning system, ASTAR-NTU, exemplifies the dual-branch, multimodal modeling paradigm now prevalent in AudioMOS research (Ritter-Gutierrez et al., 14 Jul 2025). Its core architecture features:

  • Dual-Branch Encoders: Frozen, pre-trained MuQ for music audio and RoBERTa for text, processing each modality independently.
  • Music Impression Prediction: The audio-only branch uses a transformer layer with multi-head self-attention, followed by attention pooling and a multilayer perceptron (MLP) to produce a soft probability distribution over score bins.
  • Text Alignment Prediction: A cross-attention mechanism fuses audio and text features. Text features act as queries, and audio features as keys and values, enabling fine-grained alignment modeling.
  • Ordinal-Aware Classification: Instead of direct regression, prediction is cast as a classification over K=20K=20 equal-width bins ([1,5][1,5] score range), with one-hot targets softened via a Gaussian kernel:

ykexp((sck)22σ2)y_k \propto \exp \left( -\frac{(s - c_k)^2}{2 \sigma^2} \right)

where ss is the ground-truth score and ckc_k is the center of the kthk^{\text{th}} bin.

This model achieved SRCC values of 0.991 for MI and 0.952 for TA on the official test set, representing 21.21% (MI) and 31.47% (TA) relative improvement over the baseline, demonstrating the efficacy of the proposed approach (Ritter-Gutierrez et al., 14 Jul 2025).

4. Data and Reference Systems

AudioMOS 2025 uses carefully curated benchmarks and prompts generated for automatic T2M assessment. While detailed dataset information is proprietary, top systems leveraged robust audio encoders pre-trained on music (e.g., MuQ) and text encoders with strong language understanding capabilities (e.g., RoBERTa).

Models are evaluated on data that reflect practical, real-world use cases: prompts may range from genre and mood requests to more complex, multi-faceted descriptions. Model outputs are scored against human MOS annotations, serving as ground truth for benchmarking.

The baseline system, as referenced in the challenge, reported MI SRCC of 0.818 and TA SRCC of 0.724, highlighting the significant performance gap bridged by advanced cross-modal architectures and ordinal-aware learning schemes (Ritter-Gutierrez et al., 14 Jul 2025).

5. Practical and Methodological Implications

AudioMOS 2025 crystallizes several best practices for automatic audio MOS prediction:

  • Fixed, strong pre-trained encoders reduce sample complexity and overfitting, especially in music and text representation learning.
  • Multimodal alignment using cross-attention enables effective modeling of rich text-to-audio relationships, critical for TA prediction tasks.
  • Ordinal classification with Gaussian label smoothing aligns the training objective with ranking-based evaluation, mitigating issues of regression misalignment and enhancing robustness to label noise.

The challenge outperforms direct regression in terms of SRCC, a critical property for downstream evaluation and ranking systems.

A plausible implication is that such design principles will generalize to other domains involving perception-based evaluation conditioned on flexible, multimodal input, including text-to-speech, audio captioning, and cross-modal retrieval.

6. Relationship to Contemporary Audio Assessment Challenges

AudioMOS 2025 is situated within a broader backdrop of multimodal audio assessment and representation challenges. Recent efforts such as Codec-SUPERB (Wu et al., 21 Sep 2024), ICME 2025 Audio Encoder Capability (Zhang et al., 25 Jan 2025), DCASE Audio Question Answering (Yang et al., 12 May 2025), and MISP 2025 (Gao et al., 20 May 2025) collectively emphasize:

  • Unified and reproducible experimental standards.
  • Multi-task and multi-domain evaluation (e.g., speech, environmental sound, music).
  • Broad adoption of cross-modal and self-supervised learning.
  • Rigorous objective and subjective metrics, often utilizing rank-based criteria.

These interrelated efforts inform methodological advances in AudioMOS 2025, particularly the focus on robust evaluation, multi-domain adaptability, and the use of scalable, training-free evaluation pipelines.

7. Limitations and Future Perspectives

Despite its advances, AudioMOS 2025 inherits several open challenges:

  • Subjectivity of MOS: While ordinal smoothing and rank-correlation metrics address some noise in human assessment, subjectivity in expert scores remains a confounding variable.
  • Prompt Diversity and Bias: The efficacy of TA prediction depends on the breadth and semantic diversity of prompts. Models may underperform on highly novel or underrepresented prompt types.
  • Scalability: Strong encoders and Transformers enable high performance but can introduce substantial computational cost, possibly limiting real-time or resource-constrained deployment.

Future research directions include integration of adaptive, lightweight encoders, expansion of prompt diversity, extension to more general generative audio tasks, and further refinement of evaluation protocols that better account for the creative and subjective nuances inherent to music perception.

A plausible implication is the continued convergence of model-based and human-centric assessment frameworks, enhancing both automated and hybrid evaluation pipelines for generative audio technologies.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this topic yet.