UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset (2107.05233v1)

Published 12 Jul 2021 in eess.AS

Abstract: Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset. The common wisdom is that SSL helps resource-limited tasks in which only a limited amount of labeled data is available. The benefit of SSL keeps diminishing when the labeled training data amount increases. To our best knowledge, at most a few thousand hours of labeled data was used in the study of SSL. In contrast, the industry usually uses tens of thousands of hours of labeled data to build high-accuracy speech recognition (ASR) systems for resource-rich languages. In this study, we take the challenge to investigate whether and how SSL can improve the ASR accuracy of a state-of-the-art production-scale Transformer-Transducer model, which was built with 65 thousand hours of anonymized labeled EN-US data.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (7)

Chengyi Wang (32 papers)
Yu Wu (196 papers)
Shujie Liu (101 papers)
Jinyu Li (164 papers)
Yao Qian (37 papers)
Kenichi Kumatani (15 papers)
Furu Wei (291 papers)

Citations (12)

View on Semantic Scholar

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset (2107.05233v1)

Related Papers