Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation (2006.05474v2)

Published 9 Jun 2020 in eess.AS, cs.CL, and cs.SD

Abstract: Transfer learning from high-resource languages is known to be an efficient way to improve end-to-end automatic speech recognition (ASR) for low-resource languages. Pre-trained or jointly trained encoder-decoder models, however, do not share the LLMing (decoder) for the same language, which is likely to be inefficient for distant target languages. We introduce speech-to-text translation (ST) as an auxiliary task to incorporate additional knowledge of the target language and enable transferring from that target language. Specifically, we first translate high-resource ASR transcripts into a target low-resource language, with which a ST model is trained. Both ST and target ASR share the same attention-based encoder-decoder architecture and vocabulary. The former task then provides a fully pre-trained model for the latter, bringing up to 24.6% word error rate (WER) reduction to the baseline (direct transfer from high-resource ASR). We show that training ST with human translations is not necessary. ST trained with machine translation (MT) pseudo-labels brings consistent gains. It can even outperform those using human labels when transferred to target ASR by leveraging only 500K MT examples. Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (3)

Changhan Wang (46 papers)
Juan Pino (50 papers)
Jiatao Gu (83 papers)

Citations (28)

View on Semantic Scholar

Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation (2006.05474v2)

Related Papers