Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model (1909.05330v1)

Published 11 Sep 2019 in eess.AS, cs.LG, cs.SD, and stat.ML

Abstract: Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and LLMs. This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages. Using nine Indic languages, we compare a variety of techniques, and find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model achieves a lower word error rate (WER) than both monolingual E2E models (eight of nine languages) and monolingual conventional systems (all nine languages).

Authors (9)

Anjuli Kannan (19 papers)
Arindrima Datta (3 papers)
Tara N. Sainath (79 papers)
Eugene Weinstein (5 papers)
Bhuvana Ramabhadran (47 papers)
Yonghui Wu (115 papers)
Ankur Bapna (53 papers)
Zhifeng Chen (65 papers)
Seungji Lee (2 papers)

Citations (168)

View on Semantic Scholar

Summary

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

The paper "Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model" contributes to the ongoing research on automatic speech recognition (ASR) systems, specifically those supporting multiple languages. The proposed system, which contrasts conventional monolingual systems, leverages the advantages of end-to-end (E2E) models to simplify and improve handling of multilingual corpuses, aiming to expand ASR support for diverse languages across the globe.

Key Findings and Contributions

The paper investigates an E2E model built on the Recurrent Neural Network Transducer (RNN-T) architecture, showcasing its potential for real-time applications while addressing the challenge associated with imbalanced multilingual data. The paper focuses on nine Indic languages, creating a benchmark for multilingual speech recognition that surpasses traditional approaches. The authors successfully demonstrate that multilingual E2E systems can compete with state-of-the-art monolingual systems in accuracy and efficiency by achieving at least a 10% relative reduction in word error rate (WER) compared to monolingual conventional systems across all nine languages.

Major contributions of the work include:

Streaming Real-Time Integration: The implementation of a streaming E2E model suitable for interactive applications, addressing latency constraints typically unmet by attention-based models.
Data Imbalance Management: The paper explores effective techniques for dealing with imbalanced data distribution inherent in multilingual scenarios. Techniques include conditioning on language vectors, adjusting sampling ratios during training, and incorporating language-specific adapter modules. The combination of language vectors with adapter layers attains the optimal balance, overcoming the bias towards languages with larger datasets.

Methodology and Experimental Setup

The experimental setup comprises anonymized, human-transcribed data from Google spanning nine Indian languages. The LLM (LM) architecture utilized, namely RNN-T, consists of detailed encoders and predictors, combining inputs strategically to exclude conditional independence assumptions, which can affect streaming implementation. The encoder employs long short-term memory (LSTM) layers to ensure recursive processing and prediction network accuracy.

Techniques Addressing Data Imbalance:

Data Sampling: Various sampling strategies are assessed, with the paper observing that enhanced sampling ratios only marginally benefit smaller datasets when language vectors are employed, suggesting model capacity is already assigned adequately.
Language Vector Conditioning: Encoding language-specific information gives promising results by disambiguating language output and aiding robust modeling of shared vocabularies in Indic languages with overlapping phonemes and scripts.
Adapter Modules: By introducing small, expandable adjustment modules for language-specific adaptation during the training phase, the model improves efficiency without significant parameter overhead, proving effective in refining outputs for languages with distinct acoustic features.

Implications and Future Directions

The implications of the presented work are substantial both practically and theoretically. The transformation of multiple isolated models into a singular, integrative multilingual model offers advantages in computational efficiency, resource allocation, and model complexity reduction, potentially simplifying deployment in diverse linguistic settings. The approach heralds future expansions into other multilingual contexts while exploring potential optimizations concerning dynamic data sampling and deeper language-specific model adaptations.

Future research could focus on expanding the number of languages and investigating alternative neural architectures for further reductions in latency and error rates. The integration of other methodologies, such as cross-linguistic transfer learning zones and unsupervised adaptation techniques, could extend the model's capabilities. The findings encourage broader application exploration, envisioning enhanced ASR models that are inclusive and comprehensive in transcending linguistic diversity.

PDF Markdown