Masked Autoencoders Are Articulatory Learners (2210.15195v3)

Published 27 Oct 2022 in eess.AS, cs.LG, and cs.SD

Abstract: Articulatory recordings track the positions and motion of different articulators along the vocal tract and are widely used to study speech production and to develop speech technologies such as articulatory based speech synthesizers and speech inversion systems. The University of Wisconsin X-Ray microbeam (XRMB) dataset is one of various datasets that provide articulatory recordings synced with audio recordings. The XRMB articulatory recordings employ pellets placed on a number of articulators which can be tracked by the microbeam. However, a significant portion of the articulatory recordings are mistracked, and have been so far unsuable. In this work, we present a deep learning based approach using Masked Autoencoders to accurately reconstruct the mistracked articulatory recordings for 41 out of 47 speakers of the XRMB dataset. Our model is able to reconstruct articulatory trajectories that closely match ground truth, even when three out of eight articulators are mistracked, and retrieve 3.28 out of 3.4 hours of previously unusable recordings.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a masked autoencoder model that reconstructs mistracked articulatory data by masking up to three of eight articulators.
The authors leverage time-dilated dense layers and Bi-GRUs, achieving high Pearson correlation scores and recovering 3.28 out of 3.4 hours of data.
The study’s findings expand usable dataset sizes for improved speech synthesis and inversion systems without requiring speaker-specific tuning.

Masked Autoencoders Are Articulatory Learners: A Summary

The paper, "Masked Autoencoders Are Articulatory Learners," authored by Ahmed Adel Attia and Carol Y. Espy-Wilson, presents a compelling approach to enhancing the usability of articulatory datasets by addressing data corruption issues prevalent in recordings. Utilizing the University of Wisconsin X-Ray Microbeam (XRMB) dataset, which provides articulatory recordings synced with audio data, the authors focus on reconstructing mistracked articulatory data using deep learning techniques.

Methodology

The authors propose a novel approach using Masked Autoencoders (MAEs), inspired by recent advancements in image reconstruction via masked modeling. By treating the problem of reconstructing mistracked articulatory data as a masked reconstruction issue, they leverage deep learning frameworks, specifically time-dilated dense layers and Bi-GRUs, to achieve this goal. The model is trained with variably masked inputs, thereby learning to reconstruct missing articulatory data even when up to three of the eight articulators are masked concurrently.

Dataset and Application

The XRMB dataset comprises recordings from 47 speakers, though this paper utilizes data from 41 speakers due to the quality and completeness of the datasets available for them. The data includes eight articulators, each tracked with pellets. Notably, the approach presented does not require speaker-specific hyperparameter tuning, making it broadly applicable across the dataset. The focus on reconstructing mistracked data is significant due to the substantial amounts of unusable data caused by these mistrackings.

Results and Performance

The MAE model demonstrated robust performance, retrieving an impressive 3.28 out of 3.4 hours of previously unusable articulatory recordings. Correlation scores between reconstructed and ground truth PTs, assessed via the Pearson Product-Moment Correlation, revealed the model's proficiency in outputting reliable reconstructions. The research shows enhanced performance when training involves overlapping frames and optimizing hyperparameters for specific masking levels.

However, challenges remain, primarily the reconstruction of data when multiple related articulators are concurrently mistracked. Yet, eliminating these less frequent instances still results in a substantial recovery of usable data, expanding the usable dataset significantly.

Implications and Future Work

Practically, the reconstructed data has profound implications for developing more robust speech synthesis and inversion systems, potentially leading to improved tools in both academic and clinical settings. Theoretically, the paper sets a precedent for employing masked autoencoder architectures in domains beyond image processing.

Future research should focus on developing speaker-independent models and exploring inter-articulator dependencies more comprehensively. Such advancements could further improve the approach's versatility and effectiveness across various datasets and conditions. Moreover, data augmentation techniques and the inclusion of corrupted samples in training could enhance model robustness and applicability.

This paper contributes significantly to the field of speech technology by providing a method to recover extensive data previously deemed unusable, thus advancing the utility of articulatory datasets in research and application.

PDF Markdown

Related Papers

YouTube

Show All Videos