Speech Quality Assessment (SQA)

Updated 12 April 2026

Speech Quality Assessment (SQA) is the quantitative evaluation of audio quality in speech signals, merging subjective human ratings with machine predictions.
Modern SQA leverages non-intrusive deep learning models and robust datasets, enabling scalable evaluation in telephony, conferencing, and generative AI applications.
Two-stage training that combines LLM-based pseudo-labeling with human fine-tuning enhances model generalization across languages, impairments, and out-of-distribution conditions.

Speech Quality Assessment (SQA) refers to the quantitative evaluation of perceived audio quality in speech signals, typically aligning model predictions with subjective human ratings such as the Mean Opinion Score (MOS) on a 1–5 scale. SQA acts as a critical component in evaluating and optimizing telephony, conferencing, VoIP, speech synthesis, enhancement, and diverse generative AI systems. Modern SQA frameworks address the limitations of intrusive metrics (which require a reference signal) and the scalability constraints of subjective human raters through advanced supervised, semi-supervised, and self-supervised methodologies, as well as through the synthesis of large, diversified datasets. Recent research has focused extensively on scalable, non-intrusive SQA models that generalize across languages, degradation types, and domains, leveraging deep learning, LLMs, and innovative training paradigms.

1. Problem Formulation and Classical Foundations

Speech Quality Assessment targets the prediction of perceived signal fidelity using machine learning models that bridge the gap between objective signal characteristics and subjective listening experience. The objective is to approximate human perceptual MOS—defined by ITU-T P.800’s Absolute Category Rating (ACR)—without requiring repeated, expensive listening tests (Cumlin et al., 8 Aug 2025).

Intrusive vs. Non-Intrusive Methods

Intrusive metrics (e.g., PESQ, POLQA) compare a degraded signal with a clean reference and quantify perceptual distance. These are unsuitable for real-world scenarios where no reference is available.
Non-intrusive methods predict MOS or similar perceptual scores using only the degraded (or synthesized) speech signal. This paradigm enables scalability and real-time deployment in practical systems.

Challenges

Labeled datasets are limited due to the high cost of human annotation; open-source corpora usually offer only 3–8 subjective ratings per clip, imparting significant statistical uncertainty.
Most datasets inadequately cover realistic impairment types, restricting generalization to out-of-distribution (OOD) degradations and languages.

2. Model Architectures and Dataset Construction

Recent SQA advances emphasize architectures that learn robust mappings from spectro-temporal representations to MOS (or related) scores, and scalable dataset creation via large synthetic or pseudo-labeled corpora (Cumlin et al., 8 Aug 2025).

Dataset Synthesis and Pseudo-Raters

To address annotation bottlenecks, the "LibriAugmented" dataset (101,129 speech clips) was generated by applying 15 single and 6 combined degradations—including additive noise (SNR –10 to 15 dB), clipping, gain transition, low-pass filtering, MP3 compression, pitch variation, reverberation (RT60 0.8–1.5 s), time stretch, and masking—to LibriSpeech utterances using audiomentations. A fine-tuned LLM (Vicuna-7b-v1.5, with Whisper encoder) labeled these examples via prompts such as "Please evaluate the quality of the speech sample and only answer me with a score." Label distributions were distortion-balanced (~5% per single impairment; ~8.3% per paired).

Supervised and Non-Supervised Model Designs

Convolutional/Recurrent Predictors: Models such as DNSMOS Pro and DeePMOS accept log-magnitude spectrograms. DNSMOS Pro uses 2D convolution + batch-norm + dropout, global max pooling, and a Gaussian distributional predictor (mean/variance). DeePMOS adds a bi-LSTM for frame-wise Gaussian modeling (Cumlin et al., 8 Aug 2025).
Optimization: Both use the Adam optimizer (lr 1e-4, batch size 64), with selection based on validation PCC. Training regimens typically span 60–500 epochs.

3. Training Algorithms and Supervision Paradigms

Three primary SQA model training paradigms have been systematically compared using human and pseudo-labeled data (Cumlin et al., 8 Aug 2025):

Strategy	Data Source	Regimen
Human-only	Human-rated sets	Train directly on e.g., NISQA_TRAIN_SIM (~10k clips, ~5 ratings/ea)
LLM-only	LLM-labeled data	Train solely on LibriAugmented (100k pseudo-rated clips)
Two-stage	Both	Pretrain on LibriAugmented, then fine-tune on human-rated data

Findings:

LLM-only training matches or exceeds performance of human-only training from small or biased corpora (e.g., TMHINT-QI) on many test sets.
Two-stage (LLM pretraining + human fine-tuning) consistently outperforms all single-stage approaches, especially on OOD degradations and languages.

Model Evaluation Metrics

SQA employs:

Pearson Correlation Coefficient (PCC): linear score agreement
Spearman’s Rank Correlation Coefficient (SRCC): monotonic order agreement
Root-Mean-Square Error (RMSE): absolute score deviation

4. Experimental Evaluation Across Corpora and Degradations

Generalization and robustness are primary benchmarks for SQA systems (Cumlin et al., 8 Aug 2025). The following results were observed on a diverse benchmark suite:

Test Set	DNSMOS Pro PCC (Human-only)	LLM-only	Two-stage
NISQA_TEST_LIVETALK	0.55 ± 0.05	0.46 ± 0.07	0.63 ± 0.01
Tencent w/ reverb	0.65 ± 0.06	0.60 ± 0.03	0.73 ± 0.01

Similar uplift was seen with DeePMOS. Two-stage consistently achieves the highest generalization, especially on test sets with unseen languages or distortion-mixes.

Interpretation: Pretraining with LLM-pseudo-labeled, distortion-balanced corpora enables models to learn generic mappings robust to unseen artifacts. Human-label fine-tuning calibrates predictions onto the physical perceptual scale, mitigating LLM-induced distributional biases.

5. Limitations, Error Analysis, and Methodological Implications

A series of limitations and technical insights have emerged:

LLM-generated pseudo-labels: While correlating strongly with human MOS, LLM scores inherit biases from their pretraining and may inadequately represent edge-case artifacts, e.g., under- or over-rating reverberation, unnatural codecs, or context-specific degradations.
Coverage and Representativeness: The LibriAugmented corpus is synthetic, English-only, and lacks real in-situ channel artifacts (e.g., packet loss, nonlinear hardware).
Rating Granularity: LLM response protocols, shaped by prompts, can lead to coarser, less-predictive MOS distributions compared to granular ACR scales.

The above suggests continual refinement in pseudo-rating fidelity is necessary, including multilingual, realistic, and in-the-wild degradation simulation

Markdown Report Issue Upgrade to Chat

References (1)

Leveraging LLMs for Scalable Non-intrusive Speech Quality Assessment (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Speech Quality Assessment (SQA).

Speech Quality Assessment (SQA)

1. Problem Formulation and Classical Foundations

Intrusive vs. Non-Intrusive Methods

Challenges

2. Model Architectures and Dataset Construction

Dataset Synthesis and Pseudo-Raters

Supervised and Non-Supervised Model Designs

3. Training Algorithms and Supervision Paradigms

Model Evaluation Metrics

4. Experimental Evaluation Across Corpora and Degradations

5. Limitations, Error Analysis, and Methodological Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Speech Quality Assessment (SQA)

1. Problem Formulation and Classical Foundations

Intrusive vs. Non-Intrusive Methods

Challenges

2. Model Architectures and Dataset Construction

Dataset Synthesis and Pseudo-Raters

Supervised and Non-Supervised Model Designs

3. Training Algorithms and Supervision Paradigms

Model Evaluation Metrics

4. Experimental Evaluation Across Corpora and Degradations

5. Limitations, Error Analysis, and Methodological Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research