UltraSuite-Tal80: UTS Corpus for SSI Research

Updated 13 May 2026

UltraSuite-Tal80 is a multi-speaker dataset providing ultrasound tongue imaging and matched mel-spectrogram pairs for ultrasound-to-speech conversion research.
The corpus uses a controlled, speaker-dependent protocol with precise temporal alignment and defined training, validation, and test splits for benchmarking.
Experiments using Conformer-based and bi-LSTM architectures reveal that while objective metrics are comparable, subjective tests show improved speech naturalness.

The UltraSuite-Tal80 dataset is a multi-speaker corpus designed for silent speech interface (SSI) research, primarily targeting the task of ultrasound-to-speech (UTS) conversion. It provides synchronized pairs of ultrasound tongue imaging frames and corresponding audio-derived mel-spectrograms, enabling the development and benchmarking of deep learning models that infer audible speech from non-acoustic articulatory data. UltraSuite-Tal80 has been instrumental in evaluating sequence-to-sequence architectures for UTS, facilitating objective and perceptual comparisons in the literature (Ibrahimov et al., 4 Jun 2025).

1. Corpus Structure and Acquisition Protocol

UltraSuite-Tal80 comprises speaker-specific data for four individuals, denoted as “01fi,” “02fe,” “03mn,” and “04me.” For each speaker, the corpus includes a series of utterances recorded in a controlled environment. Ultrasound tongue imaging is rendered in scanline format, with each frame having an original shape of 64 scanlines by 842 beam-columns. Frames are temporally synchronized with speech: each frame represents approximately 12 ms of articulatory movement. For computational processing, frames are resized via bicubic interpolation to 64 × 128 and linearly normalized to the [–1, 1] interval (Ibrahimov et al., 4 Jun 2025).

The corresponding target audio is represented as 80-dimensional mel-spectrogram vectors, computed with a 12 ms hop to maintain alignment with ultrasound frames. The dataset organization supports speaker-dependent machine learning protocols, as all sequence mappings and evaluation splits are defined per individual speaker.

2. Data Representation and Preparation

Input tensors are constructed as (batch_size, T, 64, 128), where T denotes the temporal dimension. Each ultrasound frame is reshaped into a flat vector $x_t \in \mathbb{R}^{8192}$ , resulting in an input sequence $(x_1, ..., x_T)$ for sequence-to-sequence models. Mel-spectrograms are extracted using established signal processing pipelines and serve as targets for every input time step.

For training and evaluation, the dataset is split as follows for each speaker:

Test set: 10 shared read sentences (segments 005_xaud–014_xaud)
Development and training: Of the remaining utterances, 90% are used for training, 10% for validation

Speaker utterance distributions are as follows:

01fi: 204 utterances
02fe: 141 utterances
03mn: 193 utterances
04me: 190 utterances

3. Benchmarking Experimental Protocols

UltraSuite-Tal80 has been used to evaluate novel deep neural network architectures for UTS, notably Conformer-based models compared to standard 2D-CNN baselines. Experiments employ mean squared error (MSE) loss between predicted and reference mel-spectrogram frames. Additional objective evaluation employs mel-cepstral distortion (MCD), defined as

$\mathrm{MCD\ [dB]} = \frac{10}{\ln 10} \sqrt{2 \sum_{n=1}^K (mc_n - \hat{mc}_n)^2},$

where $mc_n$ and $\hat{mc}_n$ denote the reference and predicted mel cepstral coefficients, respectively.

4. Objective and Subjective Evaluation

Experiments conducted with UltraSuite-Tal80 report both objective and subjective results (Ibrahimov et al., 4 Jun 2025). Objective metrics aggregate MSE and MCD values per speaker. A Mann–Whitney U test ( $\alpha = 0.05$ ) is used to evaluate statistical significance of differences among architectures. Results reveal no systematic objective benefit of Conformer-based models over baseline 2D-CNNs; differences are speaker-dependent.

A representative summary (per speaker, on the test set):

Neural Network	MSE: 01fi	MSE: 02fe	MSE: 03mn	MSE: 04me
Baseline (2D-CNN)	0.464	0.623	0.395	0.484
Conformer Base	0.511	0.618	0.462	0.524
Conformer + bi-LSTM	0.482	0.581	0.378	0.449

In contrast, subjective evaluation via MUSHRA listening tests (with 27 listeners, 0–100 naturalness scale) finds that the Conformer + bi-LSTM model yields significantly more natural speech than both the baseline and the Conformer Base models (average p-values < 0.05). These findings highlight a discrepancy between objective audio similarity metrics and listener-perceived naturalness.

5. Model Architectures and Signal Synthesis

The dataset enables the training of advanced sequence modeling architectures. In recent experiments, two principal models are employed:

Conformer Base: Linear projection of input vectors (8192 to 256), additive positional encoding, 4 stacked Conformer blocks (d=256, h=32, kernel=31, expansion e=3), LayerNorm, and output projection (256 to 80). Total parameters: 2.66M. Training time: ≈30% of the baseline.
Conformer + bi-LSTM: As above, with two bidirectional LSTM layers (hidden size 128 per direction) after LayerNorm. Output is projected to 80 dimensions. Total parameters: 5.35M. Training time: ≈80% of the baseline.

Both models generate framewise mel-spectrograms, which are synthesized into speech waveforms via a pre-trained HiFi-GAN vocoder (VCTK_V1 variant). The vocoder uses a stack of four transposed convolutional layers for upsampling and multiple residual blocks per stage. During inference, HiFi-GAN’s discriminators are frozen; output sampling rate is 16 kHz.

6. Limitations and Challenges

UltraSuite-Tal80 is primarily a speaker-dependent corpus. Trained models are specific to individual speakers; generalization to unseen speakers remains an open issue. Notably, the dataset’s sequential test set (shared utterances across speakers) supports direct comparison but does not capture broader linguistic or demographic variability.

A consistent challenge involves the inadequacy of objective metrics (MSE, MCD) for capturing perceptual quality, particularly regarding formant structure and silent segment modeling. The dataset’s structure, with perfect framewise alignment, may not reflect real-world temporal variability in articulatory-to-acoustic correspondence. Silent-segment generation remains problematic, as highlighted in experimental visualizations.

7. Impact and Future Directions

UltraSuite-Tal80 has catalyzed advances in SSI and UTS research, serving as a canonical benchmark for deep articulatory-to-acoustic mapping. It has been proposed that future work should address the following, using the UltraSuite-Tal80 paradigm:

Multi-speaker and multi-style training, potentially via speaker embeddings
Perceptually informed or adversarial loss functions to bridge the objective–subjective quality gap
Integration of additional articulatory modalities (e.g., lip video, surface EMG)
Lightweight and real-time inference methodologies (e.g., model quantization, pruning)
Improved modeling of silent-segment mapping and explicit voicing control

Such developments are anticipated to enhance the generalizability, efficiency, and practical applicability of SSI architectures trained on UltraSuite-Tal80, informing both algorithmic innovation and future corpus design (Ibrahimov et al., 4 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Conformer-based Ultrasound-to-Speech Conversion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UltraSuite-Tal80 Dataset.