Neural Network Speech Quality Assessor

Updated 12 April 2026

The paper introduces NISQA, a non-intrusive, end-to-end model combining convolutional feature extraction and self-attention to predict subjective MOS and diagnostic quality dimensions.
It leverages a CNN–Self-Attention–Attention-Pooling architecture with multi-task regression heads to analyze acoustic and network-induced impairments from degraded speech.
Empirical evaluations show high validation PCC and low RMSE, demonstrating real-time feasibility and establishing NISQA as a robust open-source standard for speech quality assessment.

A Neural Network-based Instrumental Speech Quality Assessor (NISQA) is a non-intrusive, end-to-end model designed to predict subjective Mean Opinion Scores (MOS) and related quality dimensions for transmitted speech under realistic network and channel conditions. NISQA does not require a clean reference signal, instead relying on the degraded speech alone, and is notable for combining convolutional feature extractors with self-attention mechanisms and attention-pooling to regress overall quality and diagnostic dimensions from a broad array of acoustic and network-induced impairments. Its evolution from CNN+LSTM to CNN+Self-Attention architectures and emphasis on explainability and generalizability have positioned NISQA as a widely adopted open-source reference in the field of speech quality assessment for communication and synthesis applications (Yi et al., 2022, Mittag et al., 2021, Tilkorn et al., 2021, Ragano et al., 2022).

1. Algorithmic Architecture and Model Variants

The NISQA architecture is fundamentally composed of sequential modules for local feature extraction and global temporal modeling, followed by regression heads for quality scores. The widely used CNN–Self-Attention–Attention-Pooling (CNN–SA–AP) version consists of four principal components (Mittag et al., 2021, Yi et al., 2022):

Convolutional Feature Extractor: Input waveforms are resampled (if needed) and converted to 48-band mel spectrograms via 20–25 ms windows (10 ms hop, up to 16–20 kHz). These spectrograms are sliced into overlapping segments (typically 48 × 15 frames, 150 ms), producing patch sequences. A stack of 2D convolutional layers (e.g. 6 layers, 64 filters each, 3×3 kernels) with ReLU and batch normalization transforms each segment into a feature vector (typically 384 dimensions).
Self-Attention Temporal Modeling: Patch-level features are passed into Transformer-style self-attention blocks (e.g. 2 blocks, dimension 64) to capture long-range context, enabling the model to represent distortion patterns spanning arbitrary time scales.
Attention-Based Time Pooling: For each task (MOS and diagnostic dimensions), an attention-pooling network reweights time steps, computing a weighted sum to form a fixed-size utterance embedding, with attention weights derived from per-frame feedforward nets.
Multi-Task Regression Heads: Five parallel heads (for MOS, noisiness, coloration, discontinuity, loudness) independently regress real-valued scores.

Earlier NISQA variants employed bi-directional LSTM layers for temporal modeling, but self-attention pooling has supplanted RNN-based pooling in current state-of-the-art systems, affording better global context modeling and parameter efficiency (Mittag et al., 2021, Yi et al., 2022).

2. Training Protocols and Datasets

NISQA models are trained explicitly for regression on subjective MOS, using large-scale corpora curated via ITU-T P.808 or similar protocols. The dataset strategy emphasizes coverage over language, channel, degradation type, and SNR:

Data Composition: Benchmarks such as the ConferencingSpeech 2022 Challenge aggregate up to 86,000 clips (>200 h), including simulated/real impairments: white and non-stationary noise, packet loss, bandwidth limitation, amplitude clipping, speech codecs, and network effects. MOS scores derive from both laboratory and crowdsourced listening tests, ensuring consistent ground truth (Yi et al., 2022).
Loss Objectives: The canonical NISQA loss is mean squared error (MSE) on MOS:

$L_{MSE} = \frac{1}{N} \sum_{i=1}^N \left[\mathrm{MOS}(i) - \mathrm{MOS}_p(i)\right]^2$

Bias-aware, dataset-balanced MSE is employed in multi-dataset settings to avoid domain bias (Mittag et al., 2021).

Augmentation and Preprocessing: Some challenge submissions incorporated synthetic augmentation (e.g., extra packet loss, codec artifacts) to improve robustness. However, the core NISQA framework does not perform online augmentation (Yi et al., 2022, Mittag et al., 2021).
Validation: Early stopping is governed by validation PCC. Model selection is typically done via per-dataset average correlation.

3. Quality Dimensions and Interpretability

A distinguishing feature of modern NISQA architectures is the regression of not only overall MOS, but also secondary perceptual quality dimensions:

Dimensions: Noisiness (background/additive noise), Coloration (spectral artefacts/codecs), Discontinuity (packet loss, dropouts), Loudness (level issues). Each is regressed via a dedicated output head.
Diagnostic Utility: These dimensions facilitate root-cause analysis of quality impairments, allowing operators to link high discontinuity with packet loss, or high coloration with codec bandwidth restrictions (Mittag et al., 2021).
Interpretability: Occlusion sensitivity, DeepLIFT, Integrated Gradients, and Conductance methods have been used to attribute MOS predictions to spectrotemporal regions and convolutional features. NISQA’s CNN filters specialize on interpretable patterns: silence (horizontal bands), noise/interruption (vertical spikes), and frequency-localized features. However, substantial feature redundancy is present, with some CNN channels essentially inactive, indicating over-capacity (Tilkorn et al., 2021).

4. Performance Evaluation and Benchmarking

NISQA’s benchmarking relies on standardized protocols and diverse evaluation metrics:

Primary Metrics: Root Mean Squared Error (RMSE), Pearson’s correlation coefficient (PCC), and Outlier Ratio (OR), with predictions mapped via monotonic third-order polynomial fitting to address scale mismatches (per ITU-T P.1401) (Yi et al., 2022, Ragano et al., 2022).
Empirical Results: On held-out test sets, the full NISQA model achieves validation/test PCCs between 0.87–0.97 and RMSE as low as 0.23, with outperformance of earlier single-ended predictors (P.563, ANIQUE+). Double-ended models like POLQA retain an advantage on simple laboratory speech but are surpassed by NISQA on conversational, live, or highly degraded data (Mittag et al., 2021, Yi et al., 2022).
Challenge Outcomes: In the ConferencingSpeech 2022 challenge, 11 of 18 systems outperformed the NISQA-based baseline, with top models achieving RMSE_MAP ≈ 0.32–0.34 and PCC > 0.90. Advances included deeper CNNs, multi-task objectives, ensembling, and explicit feature fusion. Reliably lower ORs at extremes indicate better handling of very poor/excellent conditions (Yi et al., 2022).
Speech Synthesis: On the VoiceMOS challenge, NISQA yielded utterance-level MSE ≈ 0.30 and LCC/SRCC ≈ 0.80, but was outperformed by SSL MOS predictors (wav2vec 2.0 finetuned, LCC/SRCC ≈ 0.87). This suggests limitations due to lack of self-supervised pretraining and challenges generalizing to synthetic speech types (Ragano et al., 2022).

5. Practical Considerations and Deployment

Deployment of NISQA as an instrumental speech quality predictor has several practical affordances:

Throughput: Inference is real-time or faster on modern GPUs (e.g., 10× faster than real time on GTX 1080 Ti), with memory footprint ≈ 50 MB for parameters (Mittag et al., 2021).
Scalability: The model generalizes robustly to unseen speaker, language, and impairment types, attributed to the diversity of the training corpus (Mittag et al., 2021).
Accessibility: Open-source code and pretrained weights are available, with simple Python APIs. Inputs can be any mono .wav format at ≥8 kHz, and outputs are JSON or dict with five scalar predictions.
Limitations: NISQA exhibits degraded performance on conditions outside its training distribution (heavy reverberation, TTS artefacts, extreme jitter), and the orthogonality of Loudness prediction is not perfect, requiring caution in interpretation (Mittag et al., 2021, Ragano et al., 2022).
Model Size: At ≈218 K parameters, NISQA remains relatively lightweight compared to contemporary SSL-based MOS predictors, making it suitable for embedded and on-device assessment (Ragano et al., 2022).

6. Extensions, Comparisons, and Outlook

Architectural Extensions: Competitive challenge submissions extend NISQA with multi-head attention, residual blocks, multi-task heads for non-MOS targets, and feature-fusion with hand-crafted audio descriptors (Yi et al., 2022).
Comparison with SSL Approaches: Self-supervised learning (SSL) models such as wav2vec 2.0 surpass NISQA in direct MOS correlation, benefiting from pretraining on large raw audio corpora. No gain was observed from fusing NISQA’s spectral features with SSL embeddings; a well-trained SSL frontend alone suffices (Ragano et al., 2022).
Dataset and Bias Considerations: The importance of rigorous, bias-minimized, and sufficiently stratified crowdsourced evaluation is highlighted, as statistical performance can be confounded by imbalanced system types or hidden artifacts in data splits (Ragano et al., 2022).
Open Issues: While advances in SSL have redefined the state-of-the-art, NISQA remains a strong, interpretable, and efficient baseline. Future improvements may rely on hybridizing end-to-end and SSL frameworks, and on the curation of more comprehensive, distortion-diverse evaluation corpora.

7. Reference Table: NISQA Model Variants and Key Results

Variant	Temporal Module	Evaluation Corpora	PCC (MOS)	RMSE	Multi-Dim Output
NISQA CNN+LSTM	BiLSTM	SwissQual P.OLQA, etc.	~0.90	~0.15	No (MOS only)
NISQA CNN+SelfAtt+AP	2-block Self-Att	Test P501, LiveTalk, etc.	0.90–0.97	0.23–0.35	Yes
Challenge Baseline 1	Dense + BiLSTM	ConfSpeech 2022 blind sets	~0.78	~0.46	No
Challenge Baseline 2	CNN+SelfAtt+AP	ConfSpeech 2022 blind sets	~0.89	~0.36	No
Top Challenge Systems	Deep CNN+SA/MTL/ens.	ConfSpeech 2022 blind sets	>0.90	~0.32	Some

References

"ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications" (Yi et al., 2022)
"NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets" (Mittag et al., 2021)
"Visualising and Explaining Deep Learning Models for Speech Quality Prediction" (Tilkorn et al., 2021)
"A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality" (Ragano et al., 2022)