Papers
Topics
Authors
Recent
Search
2000 character limit reached

Speech Quality Assessment (SQA)

Updated 12 April 2026
  • Speech Quality Assessment (SQA) is the quantitative evaluation of audio quality in speech signals, merging subjective human ratings with machine predictions.
  • Modern SQA leverages non-intrusive deep learning models and robust datasets, enabling scalable evaluation in telephony, conferencing, and generative AI applications.
  • Two-stage training that combines LLM-based pseudo-labeling with human fine-tuning enhances model generalization across languages, impairments, and out-of-distribution conditions.

Speech Quality Assessment (SQA) refers to the quantitative evaluation of perceived audio quality in speech signals, typically aligning model predictions with subjective human ratings such as the Mean Opinion Score (MOS) on a 1–5 scale. SQA acts as a critical component in evaluating and optimizing telephony, conferencing, VoIP, speech synthesis, enhancement, and diverse generative AI systems. Modern SQA frameworks address the limitations of intrusive metrics (which require a reference signal) and the scalability constraints of subjective human raters through advanced supervised, semi-supervised, and self-supervised methodologies, as well as through the synthesis of large, diversified datasets. Recent research has focused extensively on scalable, non-intrusive SQA models that generalize across languages, degradation types, and domains, leveraging deep learning, LLMs, and innovative training paradigms.

1. Problem Formulation and Classical Foundations

Speech Quality Assessment targets the prediction of perceived signal fidelity using machine learning models that bridge the gap between objective signal characteristics and subjective listening experience. The objective is to approximate human perceptual MOS—defined by ITU-T P.800’s Absolute Category Rating (ACR)—without requiring repeated, expensive listening tests (Cumlin et al., 8 Aug 2025).

Intrusive vs. Non-Intrusive Methods

  • Intrusive metrics (e.g., PESQ, POLQA) compare a degraded signal with a clean reference and quantify perceptual distance. These are unsuitable for real-world scenarios where no reference is available.
  • Non-intrusive methods predict MOS or similar perceptual scores using only the degraded (or synthesized) speech signal. This paradigm enables scalability and real-time deployment in practical systems.

Challenges

  • Labeled datasets are limited due to the high cost of human annotation; open-source corpora usually offer only 3–8 subjective ratings per clip, imparting significant statistical uncertainty.
  • Most datasets inadequately cover realistic impairment types, restricting generalization to out-of-distribution (OOD) degradations and languages.

2. Model Architectures and Dataset Construction

Recent SQA advances emphasize architectures that learn robust mappings from spectro-temporal representations to MOS (or related) scores, and scalable dataset creation via large synthetic or pseudo-labeled corpora (Cumlin et al., 8 Aug 2025).

Dataset Synthesis and Pseudo-Raters

To address annotation bottlenecks, the "LibriAugmented" dataset (101,129 speech clips) was generated by applying 15 single and 6 combined degradations—including additive noise (SNR –10 to 15 dB), clipping, gain transition, low-pass filtering, MP3 compression, pitch variation, reverberation (RT60 0.8–1.5 s), time stretch, and masking—to LibriSpeech utterances using audiomentations. A fine-tuned LLM (Vicuna-7b-v1.5, with Whisper encoder) labeled these examples via prompts such as "Please evaluate the quality of the speech sample and only answer me with a score." Label distributions were distortion-balanced (~5% per single impairment; ~8.3% per paired).

Supervised and Non-Supervised Model Designs

  • Convolutional/Recurrent Predictors: Models such as DNSMOS Pro and DeePMOS accept log-magnitude spectrograms. DNSMOS Pro uses 2D convolution + batch-norm + dropout, global max pooling, and a Gaussian distributional predictor (mean/variance). DeePMOS adds a bi-LSTM for frame-wise Gaussian modeling (Cumlin et al., 8 Aug 2025).
  • Optimization: Both use the Adam optimizer (lr 1e-4, batch size 64), with selection based on validation PCC. Training regimens typically span 60–500 epochs.

3. Training Algorithms and Supervision Paradigms

Three primary SQA model training paradigms have been systematically compared using human and pseudo-labeled data (Cumlin et al., 8 Aug 2025):

Strategy Data Source Regimen
Human-only Human-rated sets Train directly on e.g., NISQA_TRAIN_SIM (~10k clips, ~5 ratings/ea)
LLM-only LLM-labeled data Train solely on LibriAugmented (100k pseudo-rated clips)
Two-stage Both Pretrain on LibriAugmented, then fine-tune on human-rated data

Findings:

  • LLM-only training matches or exceeds performance of human-only training from small or biased corpora (e.g., TMHINT-QI) on many test sets.
  • Two-stage (LLM pretraining + human fine-tuning) consistently outperforms all single-stage approaches, especially on OOD degradations and languages.

Model Evaluation Metrics

SQA employs:

  • Pearson Correlation Coefficient (PCC): linear score agreement
  • Spearman’s Rank Correlation Coefficient (SRCC): monotonic order agreement
  • Root-Mean-Square Error (RMSE): absolute score deviation

4. Experimental Evaluation Across Corpora and Degradations

Generalization and robustness are primary benchmarks for SQA systems (Cumlin et al., 8 Aug 2025). The following results were observed on a diverse benchmark suite:

Test Set DNSMOS Pro PCC (Human-only) LLM-only Two-stage
NISQA_TEST_LIVETALK 0.55 ± 0.05 0.46 ± 0.07 0.63 ± 0.01
Tencent w/ reverb 0.65 ± 0.06 0.60 ± 0.03 0.73 ± 0.01

Similar uplift was seen with DeePMOS. Two-stage consistently achieves the highest generalization, especially on test sets with unseen languages or distortion-mixes.

Interpretation: Pretraining with LLM-pseudo-labeled, distortion-balanced corpora enables models to learn generic mappings robust to unseen artifacts. Human-label fine-tuning calibrates predictions onto the physical perceptual scale, mitigating LLM-induced distributional biases.

5. Limitations, Error Analysis, and Methodological Implications

A series of limitations and technical insights have emerged:

  • LLM-generated pseudo-labels: While correlating strongly with human MOS, LLM scores inherit biases from their pretraining and may inadequately represent edge-case artifacts, e.g., under- or over-rating reverberation, unnatural codecs, or context-specific degradations.
  • Coverage and Representativeness: The LibriAugmented corpus is synthetic, English-only, and lacks real in-situ channel artifacts (e.g., packet loss, nonlinear hardware).
  • Rating Granularity: LLM response protocols, shaped by prompts, can lead to coarser, less-predictive MOS distributions compared to granular ACR scales.

The above suggests continual refinement in pseudo-rating fidelity is necessary, including multilingual, realistic, and in-the-wild degradation simulation

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Speech Quality Assessment (SQA).