NKI-SpeechRT Dataset Overview

Updated 4 October 2025

NKI-SpeechRT is a multi-modal dataset featuring synchronized RT-MRI videos, audio, and static MRI for detailed analysis of speech articulation and severity.
It includes high-resolution imaging and acoustic recordings enriched with perceptual intelligibility and noise ratings to support robust speech diagnostics.
The dataset enables applications in articulatory phonetics, clinical research, and machine learning-driven automatic speech severity metrics and reconstruction methods.

The NKI-SpeechRT Dataset encompasses two distinct but related corpora—one centered on real-time magnetic resonance imaging (RT-MRI) of speech articulation with aligned audio (as described in (Lim et al., 2021)), the other focused on speech recordings, perceptual intelligibility ratings, and noise annotations for evaluating speech severity using acoustic and linguistic modeling (Halpern et al., 1 Oct 2025). Both have had significant impact on research in articulatory phonetics, clinical speech analysis, and the development and evaluation of automatic speech severity metrics.

1. Corpus Composition and Modalities

NKI-SpeechRT (RT-MRI Articulatory Dataset)

This component comprises multimodal data from 75 subjects performing linguistically motivated speech tasks. The main data modalities are:

@@@@6@@@@ RT-MRI Videos (Sagittal View): Captured at 83 frames per second, these dynamic mid-sagittal images visualize rapid vocal tract movements during both scripted and spontaneous speech. Each sequence is time-synchronized with co-recorded audio.
- Raw acquisition: Provided in vendor-agnostic MRD format (formerly ISMRMRD).
- Reconstructions: HDF5 format image stacks and MPEG-4 videos.
3D Volumetric Static MRI: Acquired during the sustained production of vowels, continuant consonants, and specific vocal tract postures. Each scan lasts 7 seconds per stimulus, yielding full-vocal tract 3D volumes (1.25 mm isotropic) and mid-sagittal PNG snapshots.
High-resolution Static T2-weighted MRI: Axial, coronal, and sagittal scans for detailed anatomical reference (DICOM format).

NKI-SpeechRT (Speech Severity Evaluation Dataset)

This dataset is derived from the earlier NKI-CCRT corpus but extends its utility for speech severity modeling:

Speech Recordings: 55 speakers (47 native Dutch), each with up to five sessions at different treatment stages (pre-treatment, post-treatment at 10 weeks, up to 12 months). Speech is elicited by reading the Dutch text "De vijvervrouw" by Godfried Bomans. Documented with Sennheiser MD421 microphones and Edirol Roland R‑1 recorders; downsampled to 16 kHz, 16-bit PCM.
Perceptual Intelligibility Ratings: Each recording rated by 14 recent Dutch SLP graduates, using a 7-point Likert scale. The high interrater reliability (ICC 2,k = 0.9174) substantiates the consistency of these scores.
Noise Annotations: A linguist (non-SLP) provided a subjective noise rating per recording (0: minimal, 1: moderate, 2: severe disturbances including background voices or ringing), enabling noise robustness analysis for downstream models.

2. Data Acquisition and Technical Parameters

RT-MRI: Imaging and Synchronization

MR data were collected on a 1.5 T GE Signa Excite system with 40 mT/m gradients and a custom 8-channel upper-airway receiver array. The dynamic acquisition employed a 13-interleaf spiral-out spoiled gradient-echo sequence (bit-reversed ordering), with principal parameters:

TR = 6.004 ms, TE = 0.8 ms
FOV = 200 × 200 mm², 84 × 84 matrix (2.4 × 2.4 mm² in-plane res.)
6 mm slice thickness, flip angle = 15°
Receiver bandwidth = ±125 kHz

Stimulus presentation and real-time progress feedback used the RT-Hawk platform (projector-mirror arrangement), and dual hardware clocks synchronized audio (fiber-optic microphone) and imaging.

3D Volumetric MRI relied on a 3D gradient-echo sequence with 7-fold Cartesian sparse sampling. T2-weighted imaging used fast spin echo protocols.

Speech Severity: Recording and Annotation

All acoustic data standardized to 16 kHz/16-bit PCM.
Multi-stage longitudinal design (treatment time points).
Perceptual intelligibility and noise ratings were sourced via independent listening experiments.

3. Intended Applications and Research Use Cases

Speech Science & Linguistics

The synchronized RT-MRI and audio corpus allows in-depth analysis of speech articulation, including:

Dynamic tongue, lip, and velum movement patterns.
Inter-speaker variability studies.
Phonetic/phonological phenomena related to complex speech tasks.

Clinical Research and Diagnostics

Enables visual and auditory assessment of speech in health and disease.
High-resolution volumetric and T2-weighted images provide an anatomical basis for understanding and quantifying morpho-functional changes due to treatment (e.g., head and neck cancer interventions).

Speech Severity Modeling

The speech dataset with associated intelligibility and noise scores supports the evaluation of:

Reference-free automatic speech severity metrics (e.g., SpeechLMScore).
Traditional acoustic features (shimmer, jitter, F0 SD, WADA SNR, HNR, etc.).
Reference-based phoneme error rates (PER).

Imaging and Algorithmic Innovation

Public provision of raw multi-coil MRI facilitates benchmarking and development of dynamic image reconstruction, artifact correction, and feature extraction methods.
The datasets serve as training and evaluation platforms for machine learning and deep learning-based approaches in medical imaging and speech analysis.

4. Analysis Methods and Statistical Summary

MRI Reconstruction

Dynamic RT-MRI images are reconstructed by minimizing the objective:

$\min_{m} \|A m - d\|_2^2 + \lambda \|\nabla_t m\|_1$

where $A$ is the encoding matrix (incorporating the non-uniform FFT and coil sensitivity), $m$ is the image series, $d$ the acquired data, $\nabla_t$ the temporal finite-difference operator, and $\lambda$ the regularization factor. Coil sensitivity estimation applies the Walsh method; nonlinear conjugate gradient with Fletcher-Reeves updates achieves ~83.28 fps and ~160 ms/frame reconstruction times. For 3D, sparse-SENSE and TV constraints are implemented with GPU acceleration (e.g., BART toolbox).

Speech Severity Metrics

Shimmer: $\hat{x}_{shimmer} = \frac{1}{N-1}\sum_{i=1}^{N-1}\left|\frac{A_{i+1} - A_i}{A_i}\right|$
Jitter: $\hat{x}_{jitter} = \frac{1}{N-1}\sum_{i=1}^{N-1}\left|\frac{T_{i+1} - T_i}{T_i}\right|$

SpeechLMScore: Acoustic representations from the pretrained HUBERT-BASE-LS960H model are quantized and modeled with an LSTM LM (LibriLight pretraining). Perplexity is computed for each token sequence, with lower values indicating more natural speech and lower estimated severity.

Statistical Evaluation: Model outputs (averaged at the speaker-level) are correlated to perceptual intelligibility using Pearson's r. For example:

SpeechLMScore: $r = 0.3834$ (NKI-SpeechRT, $p < 0.001$ )
PER: $r = -0.8206$ (NKI-SpeechRT, $p < 0.001$ )

Noise Robustness: SpeechLMScore’s low noise correlation ( $r = 0.0305$ , $p = 0.6741$ ) demonstrates insensitivity to recording artifacts, contrasting with features like WADA SNR ( $r = -0.2852$ , $p < 0.001$ ).

5. Data Structure, Access, and Licensing

Component	Format(s) / Structure	Notes
2D RT-MRI + audio	MRD, HDF5, MPEG-4	Synchronized, per subject/session
3D Volumetric Static MRI	MAT, PNG	Snapshots and volumes per stimulus
Static T2-weighted Anatomical MRI	DICOM	Multiplanar
Speech severity recordings + ratings	WAV (16 kHz, 16-bit), CSV (ratings)	With noise annotation columns

Total corpus size: ~966 GB across subject-organized folders (e.g., sub001, sub002, ...).
Associated scripts and tools (MATLAB, Python) are available on GitHub under the MIT license.
Distribution is via figshare, with public access under open (citable) terms.

6. Limitations and Research Implications

The NKI-SpeechRT corpus is distinguished by its breadth—dynamic imaging, high-quality speech/audio, aligned anatomical reference, and rigorous perceptual metadata. However, several considerations are salient:

The RT-MRI dataset is hampered by the intrinsic constraints of 2D imaging (mid-sagittal only) for dynamic studies, though the inclusion of 3D and static anatomical scans partly mitigates this.
For speech severity evaluation, reliance on read speech (the “De vijvervrouw” text) limits ecological generalizability, though this design is common for longitudinal intelligibility research.
While reference-based PER demonstrates higher score correlations, SpeechLMScore’s reference-free approach offers practical advantages in clinical and spontaneous speech applications where transcripts or pristine references are unavailable.
The rich noise annotations in NKI-SpeechRT enable systematic investigation of model robustness—an increasingly relevant property for real-world deployment.

A plausible implication is that future research, leveraging the multi-modality and rigorous annotation of NKI-SpeechRT, will catalyze advances in deep learning-based speech biomarker extraction, robust severity prediction, and the clinical evaluation of speech disorders. The open access model and alignment with standard data formats further encourage widespread adoption and cross-domain benchmarking.

PDF Markdown Chat (Pro)

References (2)

A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images (2021)

Reference-free automatic speech severity evaluation using acoustic unit language modelling (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to NKI-SpeechRT Dataset.