- The paper introduces a masked autoencoder model that reconstructs mistracked articulatory data by masking up to three of eight articulators.
- The authors leverage time-dilated dense layers and Bi-GRUs, achieving high Pearson correlation scores and recovering 3.28 out of 3.4 hours of data.
- The study’s findings expand usable dataset sizes for improved speech synthesis and inversion systems without requiring speaker-specific tuning.
Masked Autoencoders Are Articulatory Learners: A Summary
The paper, "Masked Autoencoders Are Articulatory Learners," authored by Ahmed Adel Attia and Carol Y. Espy-Wilson, presents a compelling approach to enhancing the usability of articulatory datasets by addressing data corruption issues prevalent in recordings. Utilizing the University of Wisconsin X-Ray Microbeam (XRMB) dataset, which provides articulatory recordings synced with audio data, the authors focus on reconstructing mistracked articulatory data using deep learning techniques.
Methodology
The authors propose a novel approach using Masked Autoencoders (MAEs), inspired by recent advancements in image reconstruction via masked modeling. By treating the problem of reconstructing mistracked articulatory data as a masked reconstruction issue, they leverage deep learning frameworks, specifically time-dilated dense layers and Bi-GRUs, to achieve this goal. The model is trained with variably masked inputs, thereby learning to reconstruct missing articulatory data even when up to three of the eight articulators are masked concurrently.
Dataset and Application
The XRMB dataset comprises recordings from 47 speakers, though this paper utilizes data from 41 speakers due to the quality and completeness of the datasets available for them. The data includes eight articulators, each tracked with pellets. Notably, the approach presented does not require speaker-specific hyperparameter tuning, making it broadly applicable across the dataset. The focus on reconstructing mistracked data is significant due to the substantial amounts of unusable data caused by these mistrackings.
Results and Performance
The MAE model demonstrated robust performance, retrieving an impressive 3.28 out of 3.4 hours of previously unusable articulatory recordings. Correlation scores between reconstructed and ground truth PTs, assessed via the Pearson Product-Moment Correlation, revealed the model's proficiency in outputting reliable reconstructions. The research shows enhanced performance when training involves overlapping frames and optimizing hyperparameters for specific masking levels.
However, challenges remain, primarily the reconstruction of data when multiple related articulators are concurrently mistracked. Yet, eliminating these less frequent instances still results in a substantial recovery of usable data, expanding the usable dataset significantly.
Implications and Future Work
Practically, the reconstructed data has profound implications for developing more robust speech synthesis and inversion systems, potentially leading to improved tools in both academic and clinical settings. Theoretically, the paper sets a precedent for employing masked autoencoder architectures in domains beyond image processing.
Future research should focus on developing speaker-independent models and exploring inter-articulator dependencies more comprehensively. Such advancements could further improve the approach's versatility and effectiveness across various datasets and conditions. Moreover, data augmentation techniques and the inclusion of corrupted samples in training could enhance model robustness and applicability.
This paper contributes significantly to the field of speech technology by providing a method to recover extensive data previously deemed unusable, thus advancing the utility of articulatory datasets in research and application.