USC Long Single-Speaker (LSS) Dataset
- USC LSS dataset is a multimodal speech-production resource featuring one hour of synchronized rtMRI video and high-fidelity audio of continuous speech.
- It provides detailed, processed representations including cropped vocal tract images, ROI time series, and time-aligned phonemic annotations that support diverse speech analysis.
- The dataset underpins benchmark tasks such as articulatory synthesis and phoneme recognition, offering reproducible baseline results for advanced speech technology research.
The USC Long Single-Speaker (LSS) dataset is a multimodal speech-production resource featuring approximately one hour of synchronized real-time magnetic resonance imaging (rtMRI) video and high-fidelity audio of continuous speech by a single native speaker of American English. It compiles both raw and processed representations suitable for articulatory and acoustic research, encompassing video cropped to the vocal tract, restored and denoised audio, detailed regions-of-interest (ROI) time series, and sentence-level splits with linguistically informed alignment. This dataset is unique in offering extended, continuous speech material in both read and spontaneous forms, with in-depth derived annotations, supporting a range of benchmark tasks and downstream investigations in speech science and technology.
1. Data Composition and Structure
The core data consist of simultaneous rtMRI and audio recordings acquired during speech production. The rtMRI delivers dynamic imaging of the speaker’s vocal tract, capturing the fine temporal evolution of articulatory gestures. Audio is collected under MRI environmental constraints with subsequent denoising or restoration to minimize scanner noise while preserving speech intelligibility.
Key derived representations included:
- Cropped MRI video: Post-processing removes extraneous regions, focusing exclusively on the articulatory zones (lips, tongue tip, tongue body, velum, larynx).
- ROI time series: Quantitative signals are extracted from anatomically significant regions, facilitating model input of high-dimensional articulatory features.
- Sentence-level splits: Continuous streams are segmented into individual sentences, each accompanied by time-aligned phonemic labels derived from forced alignment and partial manual verification.
- Restored and denoised audio: High-fidelity representations intended for both perceptual and automatic speech processing tasks.
Alignment information includes phoneme-level annotation, essential for benchmarking recognition and synthesis tasks, and constructed through verified forced alignment to ensure temporal accuracy between modalities.
2. Distinctive Dataset Characteristics
The LSS dataset is distinguished by its duration, comprising nearly one hour of uninterrupted speech data, surpassing prior publicly available single-speaker rtMRI corpora in length. The inclusion of both read and spontaneous/conversational speech broadens its coverage, accommodating analyses of controlled versus naturalistic articulatory patterns.
Preprocessing protocols are meticulously documented: MRI video is cropped to retain only speech-relevant anatomy; audio streams undergo denoising via contemporary restoration models to counteract the inherent MRI scanner noise. Sentence-based segmentation and annotation further facilitate targeted linguistic analyses.
This design supports investigations of long-form speech, enabling paper of within-speaker variability, temporal coarticulation, and dynamic changes across speech styles. A plausible implication is enhanced suitability for probing phenomena that require continuous or variable articulatory context.
3. Benchmark Tasks and Applications
The dataset is accompanied by baseline benchmark results for two principal task domains:
- Articulatory synthesis: Leveraging paired MRI and audio, models are trained to map articulatory movements to intelligible speech. Baseline results are provided using neural vocoder frameworks (e.g., a variant of HiFi-GAN), with hyperparameters and upsampling rates specified as LaTeX equations in the original publication.
- Phoneme recognition: Benchmarks are reported for models trained on unimodal (audio-only or video-only) and multimodal (combined acoustic and articulatory) inputs, implemented with conformer-based architectures. Phoneme error rates (PER) serve as quantitative baselines for comparative evaluation.
The LSS offers an ideal test bed for:
- Articulatory inversion: model development for mapping acoustic signals to estimated articulatory gestures.
- Studies of coarticulation and intra-speaker variability: analysis of subtle, temporally distributed changes not accessible in shorter datasets.
- Investigation of region-of-interest trajectories in speech dynamics.
Performance metrics and implementation details are documented to ensure experimental reproducibility and robust benchmarking.
4. Imaging and Signal Processing Technicalities
Real-time MRI captures detailed articulatory signals at high temporal resolution; after acquisition, images are cropped to speech-relevant anatomy. Audio restoration utilizes state-of-the-art algorithms specifically designed to mitigate MRI suite noise artifacts.
The articulatory synthesis framework adapts neural vocoder architectures (e.g., HiFi-GAN) for direct articulatory-to-speech mapping, with hyperparameters (such as learning rate and upsampling configuration) specified for replicability. For phoneme recognition, conformer models are parameterized with detailed input feature dimensionality and training schedules, forming a replicable, robust baseline.
Technical specifications for imaging resolution, audio sampling, and synchronization protocols are made available for downstream system optimization, though specific numeric parameters should be consulted directly in the citation.
5. Research Prospects and Methodological Extensions
The documented benchmarks serve as foundational baselines, with current methods yielding “out-of-the-box” performance suitable for longitudinal improvement. The authors suggest manifold directions for future exploration:
- Multimodal modeling: Enhanced fusion of acoustic and articulatory modalities, potentially via more sophisticated attention mechanisms or graph-based integration.
- Articulatory inversion: Research into mapping from the acoustic domain back to the articulatory gestures, particularly in iterative or cascaded model pipelines.
- Variability and coarticulation analyses: Using long-form data to examine intra-speaker adaptation, stylistic fluctuations, and extended coarticulatory sequences.
- Application synthesis: Integration into advanced speech therapy, brain–computer interface development, and robust speech recognition systems that are sensitive to underlying physical articulatory processes.
A plausible implication is that the LSS dataset, by virtue of its length and degree of annotation, facilitates advances in speech technology that account for physical speech production mechanisms, filling a gap in available research corpora for modeling articulatory–acoustic relationships.
6. Significance in Speech Science and Technology
The USC LSS dataset serves as a unique resource bridging gaps in multimodal speech production research. Its nearly hour-long pairing of MRI and audio data from a single native American English speaker, enriched by detailed derived representations and methodologically sound preprocessing, provides a benchmark for articulatory synthesis and phoneme recognition tasks.
The dataset’s documentation ensures reproducibility and comparative analysis, offering a standardized platform for model evaluation. It establishes a reference for future work in multimodal integration, articulatory inversion, and fine-grained analysis of within-speaker variability.
This suggests its role as a cornerstone for developing physically grounded, data-driven speech models, suitable for both foundational research and applied engineering in speech perception, production, and human–machine interaction.