Improving Massively Multilingual ASR With Auxiliary CTC Objectives (2302.12829v2)

Published 24 Feb 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety of languages. With how many languages these models have to handle, however, a key to understanding their imbalanced performance across different languages is to examine if the model actually knows which language it should transcribe. In this paper, we introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark, by conditioning the entire model on language identity (LID). We investigate techniques inspired from recent Connectionist Temporal Classification (CTC) studies to help the model handle the large number of languages, conditioning on the LID predictions of auxiliary tasks. Our experimental results demonstrate the effectiveness of our technique over standard CTC/Attention-based hybrid models. Furthermore, our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER. Trained models and reproducible recipes are available at https://github.com/espnet/espnet/tree/master/egs2/fleurs/asr1 .

Authors (6)

William Chen (49 papers)
Brian Yan (40 papers)
Jiatong Shi (82 papers)
Yifan Peng (147 papers)
Soumi Maiti (26 papers)
Shinji Watanabe (416 papers)

Citations (35)

View on Semantic Scholar

Summary

Improving Massively Multilingual ASR With Auxiliary CTC Objectives

The paper addresses the enhancement of multilingual automatic speech recognition (ASR) systems, focusing on improving performance across a wide spectrum of languages. It specifically deals with FLEURS, a 102-language ASR benchmark known for its typological diversity and low-resource nature. The research seeks to optimize multilingual ASR models by conditioning them on language identity (LID) using techniques inspired by connectionist temporal classification (CTC).

Research Context and Techniques

In multilingual ASR, one of the challenges is handling the variability in phonology, grammar, and scripts across languages. The solution proposed in this paper involves explicitly modeling LID to improve transcription accuracy. The authors leverage a hybrid CTC/Attention architecture as the foundation, utilizing self-conditioned CTC models integrated within encoder layers to enhance recognition.

The central innovation lies in the hierarchical conditioning approach, where LID predictions are utilized in earlier encoder layers, allowing the model to focus on language identification before proceeding with transcription tasks. This strategy aims to reduce the noise from early transcription predictions by separating language recognition from transcription effectively.

Experimental Configuration and Results

The experiments conducted utilize the FLEURS dataset, which includes diverse languages each with limited training data (7-10 hours per language). The researchers employ self-supervised learning (SSL) features from XLS-R and WavLM, using both Transformer and Conformer architectures.

Key results show that conditioning based on LID, especially at the token level (LID$_{\mathsf{tok}$), significantly outperforms other methods, leading to a 28.4% relative reduction in Character Error Rate (CER) compared to previous state-of-the-art results. The Conformer architecture, paired with hierarchical LID-based methods, further enhances the model's performance, demonstrating notable gains across various language groups, particularly in regions less represented in ASR research.

Theoretical and Practical Implications

The research contributes to the growing body of work aimed at inclusive language technologies. By effectively using language identity as a foundational aspect of ASR model design, it offers a pathway toward more reliable speech recognition systems across less-resourced languages. This approach not only enhances transcription accuracy but also provides insights into the underlying multilingual decision processes, improving the explainability of ASR systems.

Future Directions

The findings suggest potential for extending similar frameworks to larger datasets and more languages, potentially collaborating with ongoing multilingual initiatives. This methodology also holds promise for integration into broader language technology tools, such as those for speech alignment and data cleaning, to further democratize access to speech technologies across linguistic divides.

In conclusion, this paper illustrates a sophisticated approach to multilingual ASR, emphasizing the role of language identity conditioning in optimizing model performance and reliability across diverse and low-resource language contexts.

PDF Markdown