Papers
Topics
Authors
Recent
2000 character limit reached

MOS-RMBench: Reward Modeling Benchmark

Updated 5 October 2025
  • MOS-RMBench is a unified benchmark framework that converts traditional MOS ratings into preference pairs for reproducible speech quality reward modeling.
  • It supports scalar, semi-scalar, and generative models by integrating losses and reinforcement strategies to handle annotation inconsistencies and scale bias.
  • Adaptive reward design using normalized MOS differences enhances fine-grained quality discrimination and yields up to a 3% accuracy gain on challenging pairs.

MOS-RMBench is a unified benchmark framework for reward modeling in automatic speech quality assessment that reformulates conventional Mean Opinion Score (MOS) datasets into a preference-comparison paradigm. This approach aims to resolve systematic limitations of MOS-based evaluation—such as annotation inconsistency, dataset-specific scale bias, and lack of reproducibility—by establishing a rigorous, scalable methodology for evaluating and training reward models across diverse speech corpora and modeling paradigms.

1. Paradigm Shift in Speech Quality Assessment

Traditional MOS evaluation relies on absolute human ratings, resulting in subjective scores that are often not directly comparable across datasets due to rater bias, annotation noise, and scale drift. MOS-RMBench addresses these issues by converting MOS-labeled samples into “preference pairs”: for each group of related utterances (e.g., same content, system, or speaker), samples are split into “chosen” and “rejected” members, forming a pair where the chosen sample has a higher MOS.

This preference-based conversion enables standardized, cross-dataset evaluation and avoids dependence on fragile absolute scales. Every audio sample is resampled to a unified format (typically 16 kHz WAV), reliably paired, and annotated, which supports more meaningful comparison and rigorous assessment of reward model performance.

2. Reward Modeling Approaches

MOS-RMBench supports three principal paradigms for speech reward modeling:

  • Scalar Reward Models: These models predict a single numerical score per sample and are trained using a combination of Bradley–Terry (BT) loss—maximizing the score gap between chosen and rejected samples—and Mean Squared Error (MSE) loss against MOS. The primary metric for evaluation is the accuracy reward:

Accuracy reward={1Schosen>Srejected 1otherwise\text{Accuracy reward} = \begin{cases} 1 & S_\text{chosen} > S_\text{rejected} \ -1 & \text{otherwise} \end{cases}

where S()S(\cdot) denotes the predicted scalar score.

  • Semi-Scalar Reward Models: These models integrate natural language critiques alongside scalar rating. With the assistance of Gemini-2.5-Pro, each sample is annotated in four dimensions: noise, distortion, continuity, and naturalness. The model is trained both to generate these detailed descriptions and to produce a scalar reward score. Both BT and MSE losses are applied as in scalar models.
  • Generative Reward Models (GRMs): GRMs take pairs of samples as input and generate multi-dimensional quality assessments. Training begins with supervised fine-tuning (SFT) using Gemini-2.5-Pro-derived annotations. Models are then optimized via reinforcement learning methods such as Group Relative Policy Optimization (GRPO) or Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), with the baseline reward mirroring the scalar model’s accuracy regime.

3. MOS-Aware Adaptive Reward Design

A key finding in MOS-RMBench is that discriminating fine-grained quality differences—especially for pairs with very small MOS gaps—remains a challenge for all modeling paradigms. To address this, a MOS-aware reward function is introduced that adaptively scales learning signals according to pairwise difficulty:

Let ΔMOS\Delta_\text{MOS} be the normalized MOS gap (clamped to [0,1][0,1]). The MOS-difference-based reward is:

MOS-difference reward={0.5[1.0+cos(ΔMOSπ)]Schosen>Srejected 0.5[1.0+cos(ΔMOSπ)]otherwise\text{MOS-difference reward} = \begin{cases} 0.5 \cdot [1.0 + \cos(\Delta_\text{MOS} \cdot \pi)] & S_\text{chosen} > S_\text{rejected} \ 0.5 \cdot [-1.0 + \cos(\Delta_\text{MOS} \cdot \pi)] & \text{otherwise} \end{cases}

Final reward combines accuracy and MOS-difference signals:

Final MOS-aware reward=Accuracy reward+MOS-difference reward\text{Final MOS-aware reward} = \text{Accuracy reward} + \text{MOS-difference reward}

This adaptive formulation yields statistically significant improvements (up to 3% accuracy gain on challenging pairs) and narrows the gap between GRMs and scalar models in fine discrimination tasks.

4. Experimental Findings

The principal empirical results obtained using MOS-RMBench are:

  • Scalar models achieve the strongest performance, exceeding 74% pairwise accuracy and reaching ~80% in both in-domain and out-of-domain settings.
  • Synthetic speech remains problematic: All models perform markedly worse on synthetic speech datasets (e.g., SOMOS, VMC’23) compared to human recordings, indicating a pronounced domain gap.
  • Fine-grained quality discrimination is challenging: Even the best models exhibit error rates around 40% for pairs with very small MOS differences.
  • MOS-aware GRMs outperform baselines on difficult pairs when reinforcement strategies (GRPO/DAPO) are used, confirming the utility of adaptive reward scaling.

5. Data Standardization and Annotation Procedures

All samples are normalized to a common format and reliability-filtered before pairing. Semi-scalar and generative models are further enriched with quality descriptions (noise, distortion, continuity, naturalness) via advanced annotation models (Gemini-2.5-Pro). This annotation is used during supervised fine-tuning of GRMs and as auxiliary training targets for semi-scalar models.

6. Benchmark Implications and Future Directions

MOS-RMBench establishes a reproducible, scalable benchmark for speech quality reward modeling that mitigates inter-dataset bias and annotation inconsistencies. The conversion of MOS scores into preference judgments and the adoption of adaptive reward strategies enable the development of models with superior sensitivity to fine-grained perceptual quality.

Planned or suggested future research directions include:

  • Expanding MOS-RMBench to encompass additional languages and more varied acoustic conditions.
  • Incorporating unmodeled speech dimensions (e.g., prosody, emotion, style) to further generalize assessment capabilities.
  • Refining dataset conversion methodologies to better retain absolute perceptual information.
  • Designing new reward models and training regimes that increase discrimination sensitivity, especially for closely matched synthetic and human speech samples.

7. Comparative Table of Modeling Paradigms

Paradigm Input / Output Training Objectives
Scalar Model Single sample / Scalar score BT loss, MSE loss
Semi-Scalar Model Single sample / Scalar + text BT loss, MSE loss, text critique
Generative RM (GRM) Sample pair / Text descriptions SFT, RL (GRPO/DAPO), adaptive reward

References and Data

All methodologies, datasets, and source code for MOS-RMBench are documented and available as specified in (Cao et al., 1 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MOS-RMBench.