Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning (1910.12729v2)

Published 28 Oct 2019 in cs.CL, cs.SD, and eess.AS

Abstract: In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances. This is achieved by proper temporal segmentation to make the representations phoneme-synchronized, and proper phonetic clustering to have total number of distinct representations close to the number of phonemes. Mapping between the distinct representations and phonemes is learned from a small amount of annotated paired data. Preliminary experiments on LJSpeech demonstrated the learned representations for vowels have relative locations in latent space in good parallel to that shown in the IPA vowel chart defined by linguistics experts. With less than 20 minutes of annotated speech, our method outperformed existing methods on phoneme recognition and is able to synthesize intelligible speech that beats our baseline model.

Citations (50)

View on Semantic Scholar

Summary

The paper proposes SeqRQ-AE, a framework for unsupervised speech recognition and synthesis using quantized representations to improve interpretability.
SeqRQ-AE processes unpaired audio using a sequential autoencoder and vector quantization, mapping representations to phonemes with minimal paired data.
Experiments show SeqRQ-AE improves phoneme recognition and text-to-speech synthesis with limited data, learning representations aligned with phonetic features.

Unsupervised Speech Recognition and Synthesis Through Quantized Speech Representation Learning

The paper proposes Sequential Representation Quantization AutoEncoder (SeqRQ-AE), a framework aimed at advancing unsupervised speech recognition and synthesis by utilizing quantized speech representation learning. This approach addresses the challenges of temporal segmentation and phonetic clustering in speech signals, facilitating the conversion of continuous speech waveforms into phoneme sequences. By doing so, SeqRQ-AE improves the interpretability and usability of learned representations for speech tasks.

Contributions and Methodology

SeqRQ-AE is designed to work primarily on unpaired audio data, producing representations that are synchronized with phonemes. The framework includes several key components:

Sequential AutoEncoder: An encoder network processes the input speech data, generating latent vector sequences which serve as the foundational representations of the speech signal.
Vector Quantization and Temporal Segmentation: Leveraging techniques from Vector Quantised Variational AutoEncoder, SeqRQ-AE quantizes the encoder's output into discrete phonetic units. The temporal segmentation process groups consecutive codewords to establish segment boundaries corresponding to individual phonemes.
Quantized Representation Mapping: A small amount of paired data is used to map quantized representations to human-defined phonemes. This mapping allows the learned representations to be interpreted in linguistic terms, aiding speech recognition and synthesis tasks.

Experimental Results

The experiments demonstrate significant improvements in phoneme recognition and text-to-speech synthesis compared to baseline models. With less than 20 minutes of annotated speech data, SeqRQ-AE outperformed existing methods in phoneme recognition, achieving a higher phoneme error rate (PER) reduction. The learned vowel representations were shown to closely approximate the International Phonetic Alphabet (IPA) vowel chart, indicating meaningful clustering of phonetic features in the latent space.

For text-to-speech synthesis, SeqRQ-AE exhibited enhanced robustness, particularly when generating intelligible speech from limited paired data. Mean Opinion Scores (MOS) reflected these improvements, with SeqRQ-AE surpassing Speech Chain models in generating more complete and natural speech outputs.

Implications and Future Directions

SeqRQ-AE's ability to learn representations that align with linguistic classifications while using minimal paired data has pragmatic implications. It suggests a path towards scalable unsupervised speech processing systems, potentially reducing the need for extensive annotated datasets.

The development of SeqRQ-AE brings theoretical insights into how deep learning models can effectively handle unsupervised tasks in speech processing. It provides evidence that meaningful phonetic representations can be extracted from audio signals without relying heavily on supervised data.

Future work could explore the integration of unpaired text data into the SeqRQ-AE framework, steering towards fully unsupervised models that encompass both speech recognition and synthesis. This advancement would significantly broaden the applicability of such models, fostering developments in human-machine interactions and multilingual speech processing systems.

In conclusion, this paper presents a notable advancement in unsupervised speech recognition and synthesis, offering a methodology that harnesses quantized speech representations for improved phoneme mapping and synthesis tasks. The proposed SeqRQ-AE framework underscores the potential for further innovations in the field of unsupervised learning within AI.

Related Papers

YouTube

Show All Videos