Unsupervised speech representation learning using WaveNet autoencoders (1901.08810v2)

Published 25 Jan 2019 in cs.LG, eess.AS, and stat.ML

Abstract: We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.

PDF Abstract

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

This paper presents an exploration of unsupervised learning in speech representation, focusing on the application of autoencoding neural networks to speech waveforms. The fundamental objective is to develop a representation that captures high-level semantic content such as phoneme identities, while being invariant to various confounding low-level signal features like pitch contour and background noise.

Methodology and Models

The research evaluates different constraints on the latent representation by leveraging three autoencoder variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). These variants are tested with the WaveNet decoder, known for its ability to capture high-quality audio signals in synthesis tasks. The choice of autoencoding neural networks aims to circumvent the computational and data requirements of purely supervised methods, seeking robust representations of phonetic content distanced from speaker-specific attributes.

In the unsupervised speech representation learning framework established, the authors employ a WaveNet autoencoder, which they argue promotes a disentangled and phoneme-focused representation by dumping additional low-level detail reconstruction responsibilities onto a high-capacity decoder. The VQ-VAE, in particular, is highlighted for yielding significant separation between phonetic content and speaker information, making it a focal point in this paper.

Quantitative Evaluation

The paper provides detailed quantitative evaluations of learned representations across several dimensions:

Speaker Independence: The representations are assessed for their degree of independence from speaker-specific features.
Phonetic Prediction: The accuracy of predicting phonetic content from representations is measured.
Reconstruction Quality: The capability to reconstruct individual spectrogram frames is analyzed.

For discrete encoding (VQ-VAE), the paper introduces a token interpretability analysis by mapping discrete encodings to phonemes, delivering results aligning with the top contenders in the 2017 ZeroSpeech unsupervised acoustic unit discovery task. This achievement underscores the method’s efficacy in modeling phonetic subtlety through its latent vectors.

Implications and Future Directions

The findings underscore the potential of autoencoding architectures with powerful decoders like WaveNet in enabling more resource-effective solutions for ASR in low-resource settings. The structured and quantifiable improvement in phonetic representation, especially with discrete latent variables, suggests promising pathways for phonetic unit discovery and speaker-invariant ASR systems.

Additionally, the introduction of techniques like time-jitter regularization serves to further enhance the learned representation, pushing the boundaries of what unsupervised models can achieve. Future developments based on this paper may explore enriching these unsupervised models with hybrid approaches that could integrate probabilistic frameworks or exploit larger corpora to generalize these learned representations across languages and dialects.

Overall, this research elaborates on how unsupervised learning methodologies could complement existing supervised speech models, reducing their dependence on extensive labeled datasets while advancing the state-of-the-art in speech representation learning. As unsupervised methods evolve, particularly those integrating generative and autoencoding paradigms, their application to speech technology may revolutionize both theoretical understanding and practical implementation in the domain.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jan Chorowski (29 papers)
Ron J. Weiss (30 papers)
Samy Bengio (75 papers)
Aäron van den Oord (14 papers)

Citations (312)

View on Semantic Scholar

Unsupervised speech representation learning using WaveNet autoencoders (1901.08810v2)