Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Joint Unsupervised and Supervised Training for Multilingual ASR (2111.08137v1)

Published 15 Nov 2021 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked LLMing (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3%, and the state-of-the-art 2-stage XLSR by 32%. On low-resource languages like Polish, our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision.

Citations (54)

Summary

  • The paper presents JUST, a unified approach merging self-supervised and supervised losses to enhance multilingual ASR accuracy.
  • The methodology integrates contrastive loss and masked language modeling with RNN-T, reducing word error rates by up to 33% over baselines.
  • Experimental results demonstrate that JUST outperforms state-of-the-art models, notably improving low-resource language recognition like Polish.

Joint Unsupervised and Supervised Training for Multilingual ASR

The paper presents a novel approach to training multilingual Automatic Speech Recognition (ASR) systems, emphasizing a combined method of training that integrates both unsupervised and supervised learning techniques. The work offers significant advancements over existing two-stage learning paradigms in the domain of multilingual ASR, addressing the challenges of handling imbalanced datasets and low-resource languages effectively.

The proposed model, Joint Unsupervised and Supervised Training (JUST), integrates self-supervised learning with a supervised end-to-end (E2E) loss function within a single training framework. This methodology contrasts with traditional two-stage systems that separate pretraining and finetuning phases. JUST simultaneously combines the Recurrent Neural Network Transducer (RNN-T) supervised loss with self-supervised contrastive and masked LLMing (MLM) losses, constructed on a multilingual dataset, Multilingual LibriSpeech (MLS). MLS is characterized by an imbalanced corpus across eight languages, an aspect that presents challenges in ASR due to the critical need for balanced learning across language representations.

The experimental evaluation on MLS exhibits that JUST consistently surpasses state-of-the-art multilingual ASR models, including the prevalent XLSR framework. Notably, JUST achieves an average Word Error Rate (WER) reduction of 33.3% over baseline monolingual models and 32% over the XLSR models. Such results underscore its substantial improvements in leveraging unsupervised learning alongside supervised losses. Specifically, for low-resource languages like Polish, the proposed method's WER is less than half of the monolingual baseline and outperforms transfer learning approaches reliant on external supervision.

The methodological structure of JUST integrates the advantages of both contrastive and transformer-based MLM representations along with supervised RNN-T. The feature encoder condenses raw audio features into latent speech representations, which are then subjected to both a contrastive loss network and MLM network, each enhancing different dimensions of feature patterns. This dual unsupervised learning in conjunction with an overarching supervised objective ensures robustness and optimization in multilingual training settings.

The paper’s findings have notable implications. Practically, JUST can significantly enhance ASR capabilities in resource-constrained languages, promoting inclusivity and accessibility. Theoretically, it demonstrates the power of integrating diverse learning paradigms to overcome the prevalent challenges of catastrophic forgetting and fine-tuning discrepancies in multilingual ASR settings.

Looking forward, the research sets a precedent for integrating multilayered self-supervised objectives in ASR systems and speculates on extending the JUST framework to accommodate further languages or novel unsupervised objectives. Such developments could pave the way for universally robust and scalable multilingual ASR systems without the need for extensive supervised learning resources.

Youtube Logo Streamline Icon: https://streamlinehq.com