A Configurable Multilingual Model is All You Need to Recognize All Languages (2107.05876v1)

Published 13 Jul 2021 in eess.AS, cs.CL, and cs.SD

Abstract: Multilingual automatic speech recognition (ASR) models have shown great promise in recent years because of the simplified model training and deployment process. Conventional methods either train a universal multilingual model without taking any language information or with a 1-hot language ID (LID) vector to guide the recognition of the target language. In practice, the user can be prompted to pre-select several languages he/she can speak. The multilingual model without LID cannot well utilize the language information set by the user while the multilingual model with LID can only handle one pre-selected language. In this paper, we propose a novel configurable multilingual model (CMM) which is trained only once but can be configured as different models based on users' choices by extracting language-specific modules together with a universal model from the trained CMM. Particularly, a single CMM can be deployed to any user scenario where the users can pre-select any combination of languages. Trained with 75K hours of transcribed anonymized Microsoft multilingual data and evaluated with 10-language test sets, the proposed CMM improves from the universal multilingual model by 26.0%, 16.9%, and 10.4% relative word error reduction when the user selects 1, 2, or 3 languages, respectively. CMM also performs significantly better on code-switching test sets.

Citations (36)

View on Semantic Scholar

Summary

The paper presents a configurable multilingual ASR model that enables dynamic language selection without requiring multiple per-language models.
It leverages a Transformer Transducer architecture with a universal encoder and lightweight language-specific modules to balance scalability and accuracy.
Empirical results show up to 26% WER reduction and robust handling of code-switching, highlighting significant practical deployment benefits.

Configurable Multilingual Modeling for Universal Language Recognition

The paper presents a novel approach to multilingual automatic speech recognition (ASR) by introducing the Configurable Multilingual Model (CMM). The CMM is designed to enable a single streaming end-to-end (E2E) ASR system to recognize speech in any user-selected subset of supported languages, contrasting strongly with both universal multilingual and per-language (or per-language-combination) models prevalent in industrial ASR deployments.

Key Contributions and Methodology

The authors address the limitations of both universally-trained multilingual ASR models and models that require explicit language selection through a 1-hot LID (Language ID) vector. The CMM is formulated to allow users to select an arbitrary combination of languages from those supported, thus configuring the model at inference to recognize any of the selected languages without retraining or running multiple per-LLMs in parallel.

The core of the CMM architecture builds on the Transformer Transducer (T-T), a streaming model with high industry suitability. The innovation in CMM involves partitioning the overall model into two principal components:

Universal Module: A shared Transformer-based encoder modeling language-agnostic acoustic properties.
Language-Specific Modules: Lightweight, language-specific linear layers modeling the "residual" aspects unique to individual languages.

Input representation includes a multi-hot language selection vector concatenated to the acoustic features, supporting arbitrary combinations of languages. The output at each encoder layer is a weighted sum of the universal module and the active language-specific modules, with the user-provided selection vector dictating the weights.

Vocabulary configuration is further refined by dynamically merging language-specific vocabularies at inference according to user selection, constraining decoding to expected output distributions.

Training Strategies and Implementation

Two training paradigms are offered:

Training from Scratch: Both universal and language-specific modules (including embedding, linear layers for encoder and prediction networks, and vocabularies) are trained jointly, with simulated random multi-hot language selection vectors per batch to expose the model to all relevant language combinations.
Fine-tuning: The universal module is initially trained as a general multilingual model without language-specific modules. Language-specific modules are then introduced and fine-tuned using the multi-hot selection strategy.

The linear layers for language specificity are only included in the top and bottom encoder layers to contain parameter growth, resulting in a 13% increase in total model size compared to a universal multilingual baseline—substantially less than naive mixture-of-expert approaches.

The implementation is realized in PyTorch, leveraging large-scale training on 32 V100 GPUs with 75k hours of anonymized, transcribed Microsoft data over 10 languages. Data imbalance is handled by sampling strategies.

Empirical Results

The CMM demonstrates substantial gains in word error rate (WER) over both universal multilingual and monolingual models. When users select one, two, or three languages, CMM delivers 26.0%, 16.9%, and 10.4% relative WER reductions, respectively, compared to the universal baseline. Notably, the performance degrades gracefully as more languages are enabled at inference, confirming that user selection effectively narrows the recognition scope and improves accuracy.

A further key finding is on code-switching scenarios. Using bilingual CMM configurations yields improvements of 4.8% and 16.3% relative WER reductions over the universal model on German-English and Spanish-English code-switching test sets, respectively, despite no explicit code-switching mechanisms being built into the model.

Ablation studies underline the critical role of the language-specific linear layers, particularly in the prediction network, and confirm that language-specific embeddings and vocabularies contribute to performance and user experience. Fine-tuning from a universal baseline gives a marginal but consistent WER improvement over full joint training.

Implications and Future Directions

Practically, the CMM offers dramatic efficiency improvements for ASR deployments:

Scalability: A single model can support all user-selected language combinations, avoiding exponential model proliferation.
Resource Use: Only a minor increase in parameter count is required for full support, and inference efficiency is improved by vocabulary pruning.
Deployment Flexibility: The model can be dynamically configured per user or device, making it ideal for applications serving multilingual populations or global platforms.

Theoretically, the approach demonstrates that language discrimination in speech can be handled via compact, modular residual adaptation over universal representations, aligning with contemporary insights from multi-task and modular learning.

Potential future developments include:

Scaling to Dozens or Hundreds of Languages: Methods to improve training coverage for rare language combinations and balance across user populations.
Dynamic Expert Routing: Learning data-driven gating functions rather than static user selection.
Generalization to Other Multilingual Tasks: Applying configurable modularity to non-ASR tasks such as multilingual TTS or NLU.
Model Compression and On-Device Deployment: Further reducing the overhead of the language-specific modules for resource-constrained environments.

Conclusion

The configurable multilingual modeling paradigm introduced here provides a compelling architecture for practical multilingual speech recognition systems, combining accuracy, scalability, and flexibility without the costs of model duplication or per-utterance language prediction. This configuration-based approach opens avenues for efficient deployment in heterogeneous, multilingual user bases and provides a framework extendable to a broad array of multilingual AI applications.

PDF Markdown