Unsupervised pretraining transfers well across languages (2002.02848v1)

Published 7 Feb 2020 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.

View on arXiv

Authors (4)

Morgane Rivière (26 papers)
Armand Joulin (81 papers)
Pierre-Emmanuel Mazaré (11 papers)
Emmanuel Dupoux (81 papers)

Citations (200)

View on Semantic Scholar

Summary

Unsupervised Pretraining Transfers Well Across Languages: A Formal Overview

Introduction

This paper investigates the efficacy of unsupervised pretraining, particularly using contrastive predictive coding (CPC), in transferring learned features across languages for automatic speech recognition (ASR). Traditional approaches heavily rely on supervised cross-lingual and multilingual training, dependent on annotated datasets, which are often unavailable for low-resource languages. This paper's primary claim is that unsupervised methods can compete with or surpass supervised approaches with sufficient unlabeled data.

Methodology

The core methodology revolves around CPC, an unsupervised learning approach relying on a contrastive loss framework to predict future states in an audio sequence, contrasting close and distant temporal windows. The authors propose several modifications to the original CPC model to enhance performance, including replacing batch normalization with channel-wise normalization to prevent training instability, integrating a Transformer layer in the predictor architecture, and reducing convolutional layer dimensionality for efficiency.

The pretraining process uses the Librispeech dataset and evaluates transfer learning effectiveness on various low-resource language datasets from the Common Voice database. The evaluation metric for phoneme classification is primarily the ABX discriminability score, while Phone Error Rates (PER) assess phoneme classification accuracy.

Results

The paper presents a thorough comparison among three main models: original CPC, modified CPC, and a supervised pretraining model.

Within-Language Evaluation:

The modified CPC model demonstrates superior phoneme discriminability within English, achieving a significant reduction in ABX error compared to the original CPC.

Cross-Lingual Transfer:

When transferring to other languages, the modified CPC model pre-trained on 360 hours of unlabeled data narrowly equates the performance of supervised models trained on 100 hours of labeled data in some scenarios. This indicates the potent capacity for unsupervised methods to generalize across diverse languages.

Comparison with Bottleneck Features:

Across 11 different languages, unsupervised pretraining with modified CPC features outperforms multilingual bottleneck features pre-trained on a much larger bilingual corpus, elucidating the capacity for unsupervised methods in providing competitive generalization across languages.

Implications

The findings substantiate the practicality of unsupervised pretraining for ASR systems, especially for languages underserved by linguistic resources. From a theoretical perspective, this work advocates for the potential equivalence of unsupervised and supervised training paradigms under certain conditions, particularly leveraging substantial amounts of unlabeled data to bridge the performance gap.

Future Directions

Potential future research could expand on:

Scaling unsupervised pretraining using extensive unlabeled datasets to refine transferability further.
Exploring alternative contrastive learning frameworks or pretext tasks to solidify phoneme representation across varying phonetic spaces.
Integrating these findings into end-to-end ASR systems and evaluating holistic performance across diverse acoustic environments.

In conclusion, this paper indicates a significant stride towards democratizing ASR technology for low-resource languages, suggesting a shift towards unsupervised learning paradigms where massive annotated datasets are unattainable.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - facebookresearch/CPC_audio: An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion. (359 stars)