Generative Speech Reward Model (GSRM)

Updated 4 July 2026

GSRM is a framework for speech generation that replaces opaque MOS scores with structured, interpretable feedback through comparative critiques and per-sample ratings.
It leverages diverse supervision regimes—from MOS-derived pairwise labels to ASR cross-attention and expert ratings—to create scalable reward signals in TTS, speech restoration, and dialogue systems.
GSRMs enable refined evaluation and reinforcement learning by generating detailed evidence logs and reasoning traces, thereby improving model robustness and interpretability.

Generative Speech Reward Model (GSRM) denotes a class of reward modeling approaches for speech generation in which the evaluator is itself generative, structured, or both: it may generate comparative critiques and scores for speech pairs, derive fine-grained rewards from an understanding model’s internal signals, or emit explicit reasoning traces before a final naturalness or interaction-quality judgment. Across recent work, GSRMs are used to replace or supplement Mean Opinion Score (MOS) annotation, construct reward signals for reinforcement learning and preference optimization, and provide more interpretable evaluation than black-box scalar regressors. The term is not fully standardized: in speech quality modeling it refers to generative reward models over audio pairs; in speech RLHF it refers to a reasoning-centric naturalness verifier; in interactive spoken dialogue it refers to a dual-axis evaluator of semantics and turn-taking; and in at least one non-speech paper, the same acronym instead means “Generative Structure Reward Model,” a textual reasoning model unrelated to audio (Cao et al., 1 Oct 2025, Shen et al., 14 Feb 2026, Chen et al., 16 Apr 2026, Xu et al., 22 May 2025).

1. Definition and conceptual scope

In speech research, the central problem addressed by GSRMs is the mismatch between expensive, inconsistent human evaluation and the need for scalable reward signals during model development. "From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling" reformulates speech quality assessment as preference modeling: given a chosen sample $x_c$ and a rejected sample $x_r$ , a reward model should produce outputs consistent with $S(c) > S(r)$ (Cao et al., 1 Oct 2025). "GSRM: Generative Speech Reward Model for Speech RLHF" frames the same problem around aesthetic naturalness in speech LLMs, arguing that existing naturalness evaluators usually regress raw audio to scalar scores, offer limited interpretability, and fail to generalize across different speech taxonomies (Shen et al., 14 Feb 2026). "Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models" extends the reward-modeling problem from utterance quality to full-duplex interaction quality, where semantic appropriateness and timing quality must both be evaluated from dual-channel audio (Chen et al., 16 Apr 2026).

A common property of these systems is that the reward is not treated as a single opaque regressed scalar. Instead, the evaluator produces structured evidence: comparative textual critiques and per-sample scores in MOS-RMBench-style speech reward modeling; word-level rewards from ASR cross-attention in W3AR; multi-metric preference judgments in speech restoration; dual-axis chain-of-thought analyses for spoken dialogue; or evidence logs plus final ratings in speech RLHF (Cao et al., 1 Oct 2025, Wang et al., 12 Nov 2025, Zhang et al., 24 Aug 2025, Shen et al., 14 Feb 2026). This suggests that, in current usage, a GSRM is best understood as a reward-modeling pattern rather than a single fixed architecture.

The acronym is also overloaded. In "Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning," GSRM stands for Generative Structure Reward Model and is explicitly a text-only reasoning structure evaluator, not a speech or audio model (Xu et al., 22 May 2025). That distinction is necessary because the speech literature uses the same acronym for materially different objectives and modalities.

2. Supervision regimes and benchmark construction

Recent GSRM work uses several distinct supervision regimes, ranging from human MOS reformulation to model-internal signals and metric-derived preferences. The benchmark most directly tied to speech quality reward modeling is MOS-RMBench, which is built from six public MOS datasets: BVCC, NISQA, SingMOS, SOMOS, TMHINT-QI, and VMC’23. These cover natural and synthetic speech, singing voice, noisy and enhanced speech, and multiple languages. The final benchmark contains 55,333 training pairs, 9,905 development pairs, 6,240 in-domain test pairs, and 3,000 OOD test pairs, with all samples converted to 16 kHz WAV (Cao et al., 1 Oct 2025). Pairwise labels are created by grouping samples within each dataset, keeping only pairs with different MOS values, and assigning chosen versus rejected labels according to the higher MOS. The test-set difficulty is deliberate: $\Delta MOS = |MOS_c - MOS_r|$ is mostly within 1.5 points, and many pairs have very small gaps (Cao et al., 1 Oct 2025).

Speech restoration adopts a different construction. "Multi-Metric Preference Alignment for Generative Speech Restoration" introduces GenSR-Pref, comprising 80K preference pairs. A pair is admitted only when a complementary suite of metrics unanimously prefers one restoration over another: NISQA for perceptual quality, DNSMOS for signal-level fidelity, SpeechBERTScore for content alignment, and speaker similarity for timbre preservation (Zhang et al., 24 Aug 2025). The design explicitly avoids a scalar linear combination and uses hard dominance to mitigate reward hacking.

Speech RLHF naturalness modeling relies on human expert judgments rather than pairwise MOS conversion. "GSRM: Generative Speech Reward Model for Speech RLHF" reports a large-scale human feedback dataset comprising 31k expert ratings together with an out-of-domain benchmark of real-world user-assistant speech interactions (Shen et al., 14 Feb 2026). The in-domain ConvTTS corpus contains 6,579 dialogues, while the OOD FDX-Conv benchmark contains 490 dialogues from an internal full-duplex speech LLM (Shen et al., 14 Feb 2026). For interactive dialogue reward modeling, the Dual-Axis Generative Reward Model uses a hybrid of synthetic and real data: approximately 146 hours and 6,361 synthetic dual-track samples, plus manually annotated real human-human and human-machine dialogues with timestamps, transcripts, event labels, error labels, and binary quality labels (Chen et al., 16 Apr 2026).

W3AR occupies a separate point in this design space. It uses no explicit reward annotations at the RL stage: a frozen Whisper-like ASR provides word-level feedback through cross-attention when teacher-forced on the ground-truth text of generated speech (Wang et al., 12 Nov 2025).

Formulation	Supervision source	Target task
MOS-aware GRM	MOS-derived pairwise preferences in MOS-RMBench	Speech quality reward modeling
W3AR	Frozen ASR cross-attention, no explicit reward annotations	TTS optimization
GenSR-Pref + DPO	Unanimous multi-metric preference pairs	Generative speech restoration
Dual-Axis GRM	Synthetic and real interaction annotations	Spoken dialogue interaction quality
Speech RLHF GSRM	31k expert ratings and OOD user-assistant interactions	Speech naturalness reward

3. Architectural forms and generated judgments

The most literal generative speech reward model appears in MOS-RMBench-style GRMs. These models are built on Qwen2-Audio and take a pair of audios $(x_c, x_r)$ simultaneously in pairwise comparative mode. The output is a textual comparative critique along four dimensions—Noise, Distortion, Continuity, and Naturalness—together with two scalar scores $S(c)$ and $S(r)$ in the range 1–10 (Cao et al., 1 Oct 2025). The same study also defines scalar reward models, which score a single audio with a scalar head, and semi-scalar reward models, which first generate a textual critique and then attach a scalar reward head. Within that taxonomy, the GSRM corresponds to the fully generative pairwise comparative model.

W3AR uses a different architecture entirely. The generator is an autoregressive TTS policy $\mathcal{M}_\theta$ conditioned on text $\mathbf{y}$ and an acoustic prompt $\mathbf{a}_{\text{prompt}}$ , while the evaluator is a frozen Whisper-like encoder-decoder ASR $x_r$ 0. The ASR is never fine-tuned; instead, its decoder cross-attention over encoder states is interpreted as a soft alignment between text tokens and acoustic frames. From the attention matrix $x_r$ 1, W3AR computes per-token reward components reflecting attention concentration and forward temporal progression (Wang et al., 12 Nov 2025). The formulation is model-agnostic and is demonstrated on CoSyVoice, VoiceCraft, and MaskGCT (Wang et al., 12 Nov 2025).

The Dual-Axis Generative Reward Model is a post-trained Qwen2.5-Omni-7B. Its input is dual-track audio containing both sides of a spoken interaction, including overlaps and silences. Its output is structured text of the form <response think> ... </response think>, <fluency think> ... </fluency think>, and <overall score> ... </overall score> (Chen et al., 16 Apr 2026). The semantic axis corresponds to Response Relevance, and the timing axis corresponds to Interactional Fluency. The model therefore generates two distinct reasoning channels before a final binary reward.

The reasoning-centric GSRM for speech RLHF is again built on Qwen2.5-Omni-7B, but its teacher supervision is feature-grounded. Offline preprocessing extracts vowel-level prosodic features—pitch level, pitch variation, pitch slope, intensity level, intensity variation, and duration—after forced alignment, and discretizes them into interpretable categories such as high, low, or very high (Shen et al., 14 Feb 2026). GPT-4o then synthesizes utterance-level evidence logs, a global summary explaining the ratings, and final ratings for expressive intensity, expressive correctness, intonation, NSVs and fillers, mispronunciation, pacing, and overall human-likeness (Shen et al., 14 Feb 2026). At deployment, the trained model maps raw speech to the same style of evidence log, summary, and ratings without needing the offline teacher.

4. Reward functions and optimization procedures

MOS-RMBench compares three paradigms—scalar, semi-scalar, and generative—and several optimization objectives. Scalar models can be trained with Bradley–Terry pairwise loss,

$x_r$ 2

or with MOS regression,

$x_r$ 3

GRMs undergo supervised fine-tuning and then RL-style refinement with GRPO or DAPO. Their base reward is an accuracy reward defined by

$x_r$ 4

The MOS-aware GRM adds a difficulty-sensitive shaping term based on normalized MOS difference. Let $x_r$ 5, normalize by the 90th percentile and clamp into $x_r$ 6, then define

$x_r$ 7

with final reward

$x_r$ 8

This reward emphasizes correct fine-grained distinctions on hard pairs and heavily penalizes mistakes on easy pairs (Cao et al., 1 Oct 2025).

W3AR defines reward directly from ASR cross-attention. For token $x_r$ 9, with peak index $S(c) > S(r)$ 0, attention purity is

$S(c) > S(r)$ 1

and alignment monotonicity is

$S(c) > S(r)$ 2

The combined token reward is

$S(c) > S(r)$ 3

with default $S(c) > S(r)$ 4, $S(c) > S(r)$ 5, $S(c) > S(r)$ 6, and $S(c) > S(r)$ 7. Group-relative normalization defines word-level advantages,

$S(c) > S(r)$ 8

which weight acoustic-token log-probabilities in a policy-gradient-style RL loss with KL regularization to a frozen reference policy (Wang et al., 12 Nov 2025).

In generative speech restoration, the reward signal is externalized into preference construction rather than a learned parametric evaluator. Preference pairs are created only if all four metrics agree on the ordering, and DPO is then applied. For autoregressive models, the core objective is

$S(c) > S(r)$ 9

Masked generative models and flow-matching models use analogous DPO adaptations in token space or via squared-error differences under a Gaussian-likelihood interpretation (Zhang et al., 24 Aug 2025).

The Dual-Axis Generative Reward Model uses GRPO with a reward on generated judgments

$\Delta MOS = |MOS_c - MOS_r|$ 0

where $\Delta MOS = |MOS_c - MOS_r|$ 1 checks whether the generated output matches the required reasoning-and-score format and $\Delta MOS = |MOS_c - MOS_r|$ 2 checks whether the extracted overall score matches the ground-truth binary label. Rewards are normalized within a group, and GRPO optimizes the clipped surrogate objective with KL regularization (Chen et al., 16 Apr 2026).

Speech RLHF uses yet another aggregation. The acoustic GSRM predicts multiple acoustic sub-metrics, while a transcript-based semantic GSRM predicts language complexity and related semantic ratings. The final scalar reward is the uniform average of the sub-metric scores,

$\Delta MOS = |MOS_c - MOS_r|$ 3

and GRPO is run on groups of four candidate speech responses, with repeated GSRM inference used for more stable reward estimation (Shen et al., 14 Feb 2026).

5. Applications across speech generation and interaction

The first major application is automatic speech quality assessment. MOS-RMBench explicitly targets TTS, VC, SVS, SVC, speech enhancement, distorted speech, and singing voice, and reframes quality judgment as pairwise reward modeling rather than absolute MOS prediction (Cao et al., 1 Oct 2025). In this setting, the GSRM is an evaluator rather than the speech generator itself.

TTS optimization is the second major application. W3AR treats a frozen ASR model as an evaluator and uses its cross-attention to deliver fine-grained, effectively word-level reward signals during RL. The method improves existing TTS systems and strengthens zero-shot robustness on unseen speakers, while remaining independent of the internal TTS architecture (Wang et al., 12 Nov 2025).

Speech restoration provides a third application. GenSR-Pref shows that preference-based post-training can be applied across autoregressive, masked generative, and flow-matching restoration models. The aligned models do not only improve restoration quality; they can also act as "data annotators," producing pseudo-clean targets for training discriminative models in real-world singing voice restoration, where clean references are unavailable (Zhang et al., 24 Aug 2025).

Interactive spoken dialogue introduces a fourth setting in which the reward target is not utterance naturalness alone but the quality of an interaction trajectory. The Dual-Axis Generative Reward Model listens to dual-channel audio, reasons about semantic coherence and timing behavior, and outputs a binary interaction-quality score suitable for future online RL of full-duplex spoken dialogue models (Chen et al., 16 Apr 2026).

Speech RLHF extends the reward-modeling role further. The reasoning-centric GSRM is used as a verifier inside an online RLHF loop for a full-duplex speech LLM: generated audio is evaluated acoustically, its ASR transcript is evaluated semantically, and the aggregated reward drives GRPO updates (Shen et al., 14 Feb 2026). A plausible implication is that speech GSRMs are becoming increasingly multi-component, with separate evaluators for acoustic naturalness, semantic appropriateness, and interactional timing.

6. Empirical findings, limitations, and naming ambiguities

The most direct cross-paradigm comparison comes from MOS-RMBench. Scalar reward models currently lead in overall pairwise accuracy: Scalar (Classic + BT) reaches 80.04%, Semi-scalar (Cloud + BT) reaches 78.82%, GRM + DAPO reaches 77.60%, GRM + GRPO reaches 76.94%, Gemini as judge reaches 69.26%, and UTMOS reaches 68.18% (Cao et al., 1 Oct 2025). The same study finds that BT loss outperforms MSE, that most models perform worse on synthetic speech than on human speech, and that all paradigms struggle on very small MOS gaps; even the best scalar model has at least 40% error in the smallest $\Delta MOS = |MOS_c - MOS_r|$ 4 bins (Cao et al., 1 Oct 2025). MOS-aware GRMs partly close the gap, improving overall accuracy to 78.00 with GRPO and 78.08 with DAPO, with especially visible gains on hard pairs and in noisy or speech-enhancement conditions (Cao et al., 1 Oct 2025).

W3AR reports large WER reductions and improvements in subjective quality. In-domain WER drops from 5.25 to 3.21, and out-of-domain WER drops from 8.92 to 4.54; MOS-N improves from 3.81 to 4.15 in the OOD setting, and AB tests favor W3AR outputs, especially out of domain (Wang et al., 12 Nov 2025). These results support the claim that internal signals from a frozen understanding model can act as informative fine-grained rewards.

Multi-metric preference alignment shows the opposite lesson: reward quality depends strongly on reward construction. Single-metric alignment can improve the selected metric while degrading others, whereas the unanimous multi-metric construction improves almost all reported axes simultaneously and is explicitly presented as a defense against reward hacking (Zhang et al., 24 Aug 2025). This limitation is structural: the aligned model can only be as faithful as the underlying metric suite.

For interactive spoken dialogue, the Dual-Axis Generative Reward Model reports state-of-the-art interaction-quality assessment across synthetic and real-world datasets, including 0.9853 accuracy and 0.9852 macro-F1 in-distribution, 0.8679 and 0.6931 on real-world human-human data, and 0.7727 and 0.7647 on real-world human-machine data (Chen et al., 16 Apr 2026). Yet the current reward is binary, real-world human-machine robustness depends strongly on including real data during RL-stage training, and the paper does not yet integrate the model into a deployed online RL loop (Chen et al., 16 Apr 2026).

The reasoning-centric speech RLHF GSRM reports Pearson correlation of 0.401 on in-domain ConvTTS and 0.465 on OOD FDX-Conv, compared with human inter-rater consistency of approximately 0.533 on both datasets (Shen et al., 14 Feb 2026). In online RLHF, human A/B evaluation shows RLHF wins over the base SFT model on tone in 74% of pairs, pacing in 60%, intonation in 66%, and overall naturalness in 82% (Shen et al., 14 Feb 2026). At the same time, the model’s acoustic evidence pipeline is vowel-centric, mispronunciation is not explicitly phonetic, and naive RL for the reward model itself underperforms full-data supervised fine-tuning (Shen et al., 14 Feb 2026).

Two broader methodological caveats follow. First, the acronym GSRM is not unique to speech: RLKD uses it for Generative Structure Reward Model in text reasoning (Xu et al., 22 May 2025). Second, a plausible future direction is the importation into speech of foundation-reward-model ideas such as large-scale unlabeled pretraining followed by generative preference fine-tuning. "GRAM: A Generative Foundation Reward Model for Reward Generalization" is text-only, but it shows that label smoothing in a generative preference model yields a regularized Bradley–Terry loss and that such models can generalize across tasks with limited additional supervision (Wang et al., 17 Jun 2025). This suggests a possible convergence between speech-native GSRMs and broader generative reward modeling, although that transfer remains a projection rather than an established result.