Multilingual Speech Recognition With A Single End-To-End Model (1711.01694v2)

Published 6 Nov 2017 in eess.AS, cs.AI, and cs.CL

Abstract: Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and LLM jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages.

PDF Abstract

Multilingual Speech Recognition with a Single End-to-End Model

The paper "Multilingual Speech Recognition with a Single End-to-End Model" presents a paper on employing sequence-to-sequence models for multilingual automatic speech recognition (ASR) by leveraging end-to-end learning paradigms. This research addresses the intrinsic complexities in conventional ASR systems posed by language-specific lexicons, subword units, and word inventories. It proposes a unified model that efficiently integrates acoustic, pronunciation, and LLMs into a single network for the task of multilingual speech processing across nine Indian languages.

Model Architecture and Training

The authors utilize the Listen, Attend and Spell (LAS) model—a sequence-to-sequence framework consisting of encoder-decoder-attention modules.

Encoder: A stacked bidirectional LSTM architecture that processes 80-dimensional log-mel acoustic features, where frames are stacked and stride applied for downsampling. The encoder functions analogously to an acoustic model.
Decoder: A unidirectional RNN that predicts character sequences, akin to a LLM, leveraging context from encoder state vectors aggregated through an attention mechanism.

The model is initially trained without any language identifier, surprising researchers by efficiently distinguishing between languages despite significant differences in character sets. Subsequent augmentations include adding a language identifier as input, further improving performance and reducing errors related to language confusion.

Results and Analysis

Performance benchmarks reveal:

Joint Model Performance: A unified LAS model trained on a corpus from nine Indian languages outperforms monolingual models by a relative 21% improvement in word error rate (WER), exhibiting robustness through implicit language discrimination without explicit identifiers.
Language Conditioned Variants: Additional conditioning on language identifiers further reduced WER by up to 7% relative, notably improving accuracy and maintaining a low confusion rate between languages.
Insights on Code-Switching: While successful in language differentiation, the model doesn't handle code-switching between Indian languages, indicating how LLMs can dominate ASR systems under end-to-end configurations.

Practical Implications and Future Directions

The paper underscores advancements in ASR technology, highlighting significant implications for resource-constrained multilingual settings. The demonstrated capacity for end-to-end models to encapsulate complexity in multilingual datasets could simplify deployment and enhance performance in multi-language applications, particularly in diverse linguistic regions.

Future work could explore integrating these models with separate language-specific components or expanding the datasets to encompass a broader spectrum of languages. Attention could also focus on refining the handling of code-switching phenomena, thus widening the applicability of end-to-end ASR frameworks in realistic scenarios where mixed-language usage is prevalent.

Conclusion

This research provides extensive insights into building compact, multilingual speech models without explicit language identifiers, promising substantial advancements in the field of multilingual speech recognition systems. It offers a noteworthy contribution to understanding how to manage multilingual datasets and different linguistic structures within a unified ASR framework, propelling further exploration into the optimization and expansion of multilingual capabilities in ASR systems.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Shubham Toshniwal (25 papers)
Tara N. Sainath (79 papers)
Ron J. Weiss (30 papers)
Bo Li (1107 papers)
Pedro Moreno (10 papers)
Eugene Weinstein (5 papers)
Kanishka Rao (31 papers)

Citations (258)

View on Semantic Scholar

Multilingual Speech Recognition With A Single End-To-End Model (1711.01694v2)