NISQA Test Sets Overview

Updated 7 August 2025

NISQA Test Sets are curated evaluation datasets for non-intrusive speech quality assessment, featuring both simulated impairments and live communication conditions.
They combine engineered degradations with real-world recordings, offering multi-dimensional subjective ratings including MOS and perceptual measures like noisiness and coloration.
The test sets enable robust benchmarking and model generalization in telecommunications by employing rigorous annotation protocols and established evaluation metrics.

NISQA Test Sets are a suite of curated evaluation and benchmarking datasets fundamental to research and development in non-intrusive speech quality assessment, particularly as implemented in the NISQA (Non-Intrusive Speech Quality Assessment) model family. These test sets span a comprehensive range of simulated and live speech conditions, enriched with multi-dimensional subjective quality annotations, and are used for training, validation, and robust assessment of algorithmic models intended to estimate perceived speech quality in real-world telecommunication scenarios. The core attributes of the NISQA test suite are its diversity—encompassing distortions representative of both laboratory-controlled and in-the-wild conditions—and its meticulous annotation protocols, which are harmonized with international standards such as ITU-T P.800 and P.808.

1. Overall Dataset Composition and Structure

The NISQA data pool, as reported in "NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets" (Mittag et al., 2021), aggregates 81 distinct datasets. Of these, 59 datasets (72,903 files) are used for model training and 18 datasets (9,567 files) are reserved for validation. For final evaluation and generalization testing, four independent test sets encompassing 952 files are employed.

The dataset composition includes:

55 datasets from the POLQA Pool, covering standardized transmission impairments.
7 datasets from ITU-T P Suppl. 23, targeting various experiment-driven degradations.
11 legacy internal speech quality datasets.
8 newly created datasets, balanced between controlled-simulation and live-capture (see section 2 for details).

The data encompasses a wide spectrum of conditions, including codec impairments, packet loss, bandpass filtering, clipping, and the addition of authentic background noise derived from DNS-Challenge, Audioset, freesound, and DEMAND resources.

2. Newly Created Speech Quality Datasets

To address generalization and ensure robustness against both known and unforeseen distortions, eight new datasets were specifically constructed with granular, multi-dimensional quality ratings:

Simulated Datasets:

NISQA_TRAIN_SIM: 10,000 samples, 2,322 speakers, distorted with programmatically applied impairments and diverse noise sources.
NISQA_VAL_SIM: 2,500 samples, 938 speakers, similarly constructed.

Live Datasets:

NISQA_TRAIN_LIVE: 1,020 samples, 486 speakers, derived by playing clean speech from sources such as LibriVox into real telephony devices and re-recording the output.
NISQA_VAL_LIVE: 200 samples, 102 speakers, constructed analogously.

The live datasets uniquely capture uncontrolled, real-world degradations and spontaneous conversational phenomena via environmental noise injections (e.g., typing, traffic) and real network variability. Ratings were crowdsourced (five raters per sample) according to ITU-T P.808, with MOS and four perceptual dimensions per file: noisiness, coloration, discontinuity, and loudness.

3. Independent Test Sets and the NISQA_LIVETALK Dataset

Four test sets assess model generalization to unseen conditions:

NISQA_TEST_P501, NISQA_TEST_FOR, NISQA_TEST_NSC: These combine simulated distortions and live VoIP recording degradations through platforms such as Zoom, Skype, Google Meet, WhatsApp, and Discord. Network-induced artifacts include packet loss and bandwidth reduction.
NISQA_TEST_LIVETALK: This "live-talking" set targets the realistic deployment use-case: 232 recordings, 8 talkers (balanced gender), 58 distinct environments (cafés, highways, poor-reception zones, shopping centres, etc.), and 4 files per condition. Each was rated in laboratory conditions under ITU-T P.800 with 24 expert ratings per file.

Table 1 summarizes the structure of these new test datasets:

Dataset	Source Type	Size (files)	Annotation Protocol
NISQA_TRAIN_SIM	Simulated	10,000	Crowdsourced (P.808)
NISQA_VAL_SIM	Simulated	2,500	Crowdsourced (P.808)
NISQA_TRAIN_LIVE	Live re-recorded	1,020	Crowdsourced (P.808)
NISQA_VAL_LIVE	Live re-recorded	200	Crowdsourced (P.808)
NISQA_TEST_LIVETALK	Real telephony, spontaneous	232	Laboratory (P.800)

This test set design exposes NISQA models to diverse phonetic content, recording devices, spontaneous conversational styles, and unpredictable environmental challenges, ensuring evaluation robustness and deployment relevancy.

4. Annotation and Evaluation Methodology

Subjective ratings employ MOS as well as four detailed perceptual dimensions (NOI, COL, DIS, LOUD) per clip, scored on a 1–5 scale. The annotation protocol depends on the dataset: simulated and live training/validation sets use ITU-T P.808 crowdsourced ratings (five per file), while the NISQA_LIVETALK set benefits from laboratory-controlled ITU-T P.800 assessments with much higher rater agreement (24 per file).

In the context of the ConferencingSpeech 2022 Challenge (Yi et al., 2022), additional test corpora derived from NISQA signals supplement the evaluation. The TUB set, for instance, is created from 865 pristine conversational segments degraded under 62 defined synthetic conditions; mean opinion scores for these test sets are collected with more than 18 ratings per clip.

Objective assessment metrics follow established conventions:

Root mean squared error (RMSE):

$RMSE = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (MOS(i) - MOS_p(i))^2}$

Pearson correlation coefficient (PCC).
Outlier ratio (OR): Frequency of predictions deviating from human ratings beyond a threshold.

To address cross-corpus scale variation, regression-based mapping (third-order monotonic polynomial) aligns predicted and reference MOS.

5. Role in Model Development and Generalization

The design of the NISQA test sets is grounded in the demonstrated need for broad generalization in non-intrusive speech quality estimation. Simulated datasets train models to recognize engineered and deterministic impairments, whereas live datasets, especially NISQA_LIVETALK, present complex, context-dependent distortions. This combination enforces broad coverage and model resilience.

Performance metrics reported in (Mittag et al., 2021) show that inclusion of diverse test sets—both synthetic and live—improves generalization to unknown samples. The attention pooling mechanism,

$z = \sum_{t=1}^L w_t y_t, \quad w_t = \frac{\exp(a_t)}{\sum_{\tau=1}^L \exp(a_{\tau})}$

allows models to adaptively weight relevant time segments, further enhancing predictive reliability on variable-length real-world samples.

6. Impact on Research Benchmarks and Multidimensional Assessment

NISQA test sets facilitate multidimensional speech quality research, enabling both point estimation and the development of advanced probabilistic models as illustrated in "Multivariate Probabilistic Assessment of Speech Quality" (Cumlin et al., 5 Jun 2025). With per-file ratings for overall MOS and four specific quality dimensions, the data supports multivariate Gaussian modeling:

Predictive distribution:

$p(y|x) = \mathcal{N}(y; \mu(x), \Lambda(x))$

Cholesky-based construction of covariance:

$\Lambda = \text{Softplus}_\text{diag}(L) \cdot \text{Softplus}_\text{diag}(L^\top)$

Affine transformation for calibration:

$(\hat{\mu}(x), \hat{\Lambda}(x)) = (A \mu(x) + b, A \Lambda(x) A^\top)$

These approaches permit uncertainty quantification as well as robust correlation diagnosis across noise, distortion, coloration, and loudness—capabilities not available in scalar MOS datasets.

7. Implications for Future Research and Applications

The extensive, meticulously annotated NISQA test sets have become a cornerstone for replicable and reliable evaluation in non-intrusive speech quality assessment. They provide the empirical foundation for benchmarking, model selection, and advancement of algorithms intended for network monitoring, VoIP, and online conferencing, as evidenced by their central role in the ConferencingSpeech 2022 Challenge (Yi et al., 2022). The multidimensional annotation schema allows researchers to transcend scalar MOS prediction toward nuanced, explainable, and uncertainty-aware assessment. A plausible implication is that future test set construction and evaluation protocols in this research area will require similarly diverse, environment-rich, and finely annotated resources to advance diagnostic reliability and practical deployment.