- The paper presents JUST, a unified approach merging self-supervised and supervised losses to enhance multilingual ASR accuracy.
- The methodology integrates contrastive loss and masked language modeling with RNN-T, reducing word error rates by up to 33% over baselines.
- Experimental results demonstrate that JUST outperforms state-of-the-art models, notably improving low-resource language recognition like Polish.
Joint Unsupervised and Supervised Training for Multilingual ASR
The paper presents a novel approach to training multilingual Automatic Speech Recognition (ASR) systems, emphasizing a combined method of training that integrates both unsupervised and supervised learning techniques. The work offers significant advancements over existing two-stage learning paradigms in the domain of multilingual ASR, addressing the challenges of handling imbalanced datasets and low-resource languages effectively.
The proposed model, Joint Unsupervised and Supervised Training (JUST), integrates self-supervised learning with a supervised end-to-end (E2E) loss function within a single training framework. This methodology contrasts with traditional two-stage systems that separate pretraining and finetuning phases. JUST simultaneously combines the Recurrent Neural Network Transducer (RNN-T) supervised loss with self-supervised contrastive and masked LLMing (MLM) losses, constructed on a multilingual dataset, Multilingual LibriSpeech (MLS). MLS is characterized by an imbalanced corpus across eight languages, an aspect that presents challenges in ASR due to the critical need for balanced learning across language representations.
The experimental evaluation on MLS exhibits that JUST consistently surpasses state-of-the-art multilingual ASR models, including the prevalent XLSR framework. Notably, JUST achieves an average Word Error Rate (WER) reduction of 33.3% over baseline monolingual models and 32% over the XLSR models. Such results underscore its substantial improvements in leveraging unsupervised learning alongside supervised losses. Specifically, for low-resource languages like Polish, the proposed method's WER is less than half of the monolingual baseline and outperforms transfer learning approaches reliant on external supervision.
The methodological structure of JUST integrates the advantages of both contrastive and transformer-based MLM representations along with supervised RNN-T. The feature encoder condenses raw audio features into latent speech representations, which are then subjected to both a contrastive loss network and MLM network, each enhancing different dimensions of feature patterns. This dual unsupervised learning in conjunction with an overarching supervised objective ensures robustness and optimization in multilingual training settings.
The paper’s findings have notable implications. Practically, JUST can significantly enhance ASR capabilities in resource-constrained languages, promoting inclusivity and accessibility. Theoretically, it demonstrates the power of integrating diverse learning paradigms to overcome the prevalent challenges of catastrophic forgetting and fine-tuning discrepancies in multilingual ASR settings.
Looking forward, the research sets a precedent for integrating multilayered self-supervised objectives in ASR systems and speculates on extending the JUST framework to accommodate further languages or novel unsupervised objectives. Such developments could pave the way for universally robust and scalable multilingual ASR systems without the need for extensive supervised learning resources.