Generative Pre-Training for Speech with Autoregressive Predictive Coding (1910.12607v2)

Published 23 Oct 2019 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Learning meaningful and general representations from unannotated speech that are applicable to a wide range of tasks remains challenging. In this paper we propose to use autoregressive predictive coding (APC), a recently proposed self-supervised objective, as a generative pre-training approach for learning meaningful, non-specific, and transferable speech representations. We pre-train APC on large-scale unlabeled data and conduct transfer learning experiments on three speech applications that require different information about speech characteristics to perform well: speech recognition, speech translation, and speaker identification. Extensive experiments show that APC not only outperforms surface features (e.g., log Mel spectrograms) and other popular representation learning methods on all three tasks, but is also effective at reducing downstream labeled data size and model parameters. We also investigate the use of Transformers for modeling APC and find it superior to RNNs.

Citations (166)

View on Semantic Scholar

Summary

The paper proposes Autoregressive Predictive Coding (APC) as a self-supervised generative pre-training method to learn versatile speech representations from unlabeled data, evaluated across ASR, speaker identification, and speech translation tasks.
Experimental results show APC-derived representations significantly outperform conventional features, with the transformer-based model achieving a 13.7% WER in ASR and doubling baseline accuracy in one-shot speaker identification.
APC's success suggests its potential for scalability and reducing the need for labeled data and large models, offering advantages for low-resource scenarios and broader applicability.

Generative Pre-training for Speech with Autoregressive Predictive Coding: A Technical Evaluation

The paper "Generative Pre-training for Speech with Autoregressive Predictive Coding" authored by Yu-An Chung and James Glass from the Computer Science and Artificial Intelligence Laboratory at MIT, offers insight into representation learning for speech processing using autoregressive predictive coding (APC). APC is proposed as a self-supervised, generative pre-training approach designed to extract versatile and meaningful representations from unlabelled speech data. The paper further evaluates the implicit transferability of these representations across diverse speech-related tasks including automatic speech recognition (ASR), speaker identification, and speech translation.

Methodology and Experimentation

The implementation of APC involves pre-training on large-scale unlabeled speech data from the LibriSpeech corpus, specifically the train-clean-360 subset comprising 360 hours of audio data. The methodology leverages autoregressive models that predict future frames in a speech signal sequence, drawing parallels to LLMs used in NLP for sequence prediction tasks. The authors develop two primary encoder models for APC: a recurrent neural network (RNN) based architecture and a transformer-based architecture. The comparative efficacy of these models is evaluated across three heavily researched speech tasks, offering comprehensive insights into practical performance.

Numerical Results and Analysis

Experimental outcomes have demonstrated that APC-derived speech representations surpass conventional surface features like log Mel spectrograms. Specifically, from the ASR experiments using Wall Street Journal data, the transformer-based APC model achieves a word error rate (WER) as low as 13.7, reducing the baseline error by more than 25%. Similarly, in tasks such as automatic speech translation (AST), APC consistently outperforms existing end-to-end models and approaches the benchmark set by cascaded systems. Speaker identification tests further confirm the potency of APC, highlighting its superior one-shot learning capabilities with double the accuracy of baseline techniques.

Notably, the authors point to a critical observation in speech recognition task that involves maintaining fixed pre-trained APC weights for achieving optimal results. Additionally, the transformer-based model demonstrates better performance over RNN-based models, advocating for its suitability in global structure inference over sequential data.

Implications and Future Work

The implications of this paper extend broadly in both theoretical and application-oriented directions within the field of speech processing. The successful demonstration of APC suggests potential scalability and applicability of such self-supervised models to other domains where data annotation is a constraint. Moreover, this work highlights APC's capability to optimize downstream model sizes and required labeled data, which is a significant advantage for resource-constrained scenarios, including low-resource languages.

Future work could explore improving the methodologies for fine-tuning the representation models to better cater to specified tasks, as well as increasing the volume of unlabeled data to further capitalize on the generative pre-training capacity. Additionally, extending APC model utilization to complex speech applications like synthesis or other telecommunication tasks could yield promising advancements.

By offering a streamlined approach to extracting multidimensional information from raw speech data and enabling robust transferability, this paper posits an informative next step in advancing speech technology research, aligning with the present trajectory towards creating more generalized AI models.