- The paper presents a configurable multilingual ASR model that enables dynamic language selection without requiring multiple per-language models.
- It leverages a Transformer Transducer architecture with a universal encoder and lightweight language-specific modules to balance scalability and accuracy.
- Empirical results show up to 26% WER reduction and robust handling of code-switching, highlighting significant practical deployment benefits.
Configurable Multilingual Modeling for Universal Language Recognition
The paper presents a novel approach to multilingual automatic speech recognition (ASR) by introducing the Configurable Multilingual Model (CMM). The CMM is designed to enable a single streaming end-to-end (E2E) ASR system to recognize speech in any user-selected subset of supported languages, contrasting strongly with both universal multilingual and per-language (or per-language-combination) models prevalent in industrial ASR deployments.
Key Contributions and Methodology
The authors address the limitations of both universally-trained multilingual ASR models and models that require explicit language selection through a 1-hot LID (Language ID) vector. The CMM is formulated to allow users to select an arbitrary combination of languages from those supported, thus configuring the model at inference to recognize any of the selected languages without retraining or running multiple per-LLMs in parallel.
The core of the CMM architecture builds on the Transformer Transducer (T-T), a streaming model with high industry suitability. The innovation in CMM involves partitioning the overall model into two principal components:
- Universal Module: A shared Transformer-based encoder modeling language-agnostic acoustic properties.
- Language-Specific Modules: Lightweight, language-specific linear layers modeling the "residual" aspects unique to individual languages.
Input representation includes a multi-hot language selection vector concatenated to the acoustic features, supporting arbitrary combinations of languages. The output at each encoder layer is a weighted sum of the universal module and the active language-specific modules, with the user-provided selection vector dictating the weights.
Vocabulary configuration is further refined by dynamically merging language-specific vocabularies at inference according to user selection, constraining decoding to expected output distributions.
Training Strategies and Implementation
Two training paradigms are offered:
- Training from Scratch: Both universal and language-specific modules (including embedding, linear layers for encoder and prediction networks, and vocabularies) are trained jointly, with simulated random multi-hot language selection vectors per batch to expose the model to all relevant language combinations.
- Fine-tuning: The universal module is initially trained as a general multilingual model without language-specific modules. Language-specific modules are then introduced and fine-tuned using the multi-hot selection strategy.
The linear layers for language specificity are only included in the top and bottom encoder layers to contain parameter growth, resulting in a 13% increase in total model size compared to a universal multilingual baseline—substantially less than naive mixture-of-expert approaches.
The implementation is realized in PyTorch, leveraging large-scale training on 32 V100 GPUs with 75k hours of anonymized, transcribed Microsoft data over 10 languages. Data imbalance is handled by sampling strategies.
Empirical Results
The CMM demonstrates substantial gains in word error rate (WER) over both universal multilingual and monolingual models. When users select one, two, or three languages, CMM delivers 26.0%, 16.9%, and 10.4% relative WER reductions, respectively, compared to the universal baseline. Notably, the performance degrades gracefully as more languages are enabled at inference, confirming that user selection effectively narrows the recognition scope and improves accuracy.
A further key finding is on code-switching scenarios. Using bilingual CMM configurations yields improvements of 4.8% and 16.3% relative WER reductions over the universal model on German-English and Spanish-English code-switching test sets, respectively, despite no explicit code-switching mechanisms being built into the model.
Ablation studies underline the critical role of the language-specific linear layers, particularly in the prediction network, and confirm that language-specific embeddings and vocabularies contribute to performance and user experience. Fine-tuning from a universal baseline gives a marginal but consistent WER improvement over full joint training.
Implications and Future Directions
Practically, the CMM offers dramatic efficiency improvements for ASR deployments:
- Scalability: A single model can support all user-selected language combinations, avoiding exponential model proliferation.
- Resource Use: Only a minor increase in parameter count is required for full support, and inference efficiency is improved by vocabulary pruning.
- Deployment Flexibility: The model can be dynamically configured per user or device, making it ideal for applications serving multilingual populations or global platforms.
Theoretically, the approach demonstrates that language discrimination in speech can be handled via compact, modular residual adaptation over universal representations, aligning with contemporary insights from multi-task and modular learning.
Potential future developments include:
- Scaling to Dozens or Hundreds of Languages: Methods to improve training coverage for rare language combinations and balance across user populations.
- Dynamic Expert Routing: Learning data-driven gating functions rather than static user selection.
- Generalization to Other Multilingual Tasks: Applying configurable modularity to non-ASR tasks such as multilingual TTS or NLU.
- Model Compression and On-Device Deployment: Further reducing the overhead of the language-specific modules for resource-constrained environments.
Conclusion
The configurable multilingual modeling paradigm introduced here provides a compelling architecture for practical multilingual speech recognition systems, combining accuracy, scalability, and flexibility without the costs of model duplication or per-utterance language prediction. This configuration-based approach opens avenues for efficient deployment in heterogeneous, multilingual user bases and provides a framework extendable to a broad array of multilingual AI applications.