Multilingual ASR Using Discrete Tokens

Updated 13 October 2025

Multilingual ASR using discrete tokens is a paradigm that converts continuous speech signals into quantized representations via clustering or quantization, enabling scalable and accent-robust recognition.
The approach leverages self-supervised models like HuBERT and XLSR to transform acoustic features into discrete indices, significantly reducing storage and computation while improving cross-lingual adaptation.
Empirical results demonstrate notable efficiency gains, such as up to 41% relative WER reduction and substantial training time reductions, making token-based systems competitive with traditional methods.

Multilingual speech recognition using discrete tokens is an emerging paradigm that leverages quantized representations derived from self-supervised learning models or well-defined acoustic transformations to enable efficient, scalable, and robust automatic speech recognition (ASR) across diverse languages. In this framework, continuous speech signals are transformed into sequences of discrete units—either through clustering, compression, or direct discretization of acoustic features—which replace traditional continuous-valued inputs such as mel-filterbanks. This approach facilitates major reductions in storage and computation, improves the compatibility of speech features with text-based language modeling, and provides new avenues for cross-lingual adaptation, accent robustness, and modular integration with generative models.

1. Principles of Discrete Tokenization for ASR

Discrete tokenization in ASR refers to converting high-dimensional continuous features (e.g., SSL model outputs or mel-spectrogram vectors) into sequences of quantized indices via clustering or quantization. Seminal approaches utilize a pre-trained self-supervised encoder (e.g., WavLM, HuBERT, XLSR) to map speech utterances into hidden embeddings. These embeddings are then quantized using methods such as k-means clustering or residual vector quantization (RVQ), yielding frame-level tokens:

For k-means: $t_n = \arg \min_{c \in C} ||z_n - \mu_c||^2$ where $z_n$ is the SSL embedding and $\mu_c$ are cluster centroids (Chang et al., 2023).
For RVQ: At each stage $i$ , $q_i = \arg \min_{c \in C_i} ||z_{i-1} - c||_2$ , with quantized codes summarized over multiple codebooks (Shechtman et al., 10 Oct 2024).

Discrete tokens can be further post-processed through deduplication (merging sequential identical tokens) and subword modeling (applying algorithms like SentencePiece or BPE) to decrease sequence length and increase linguistic consistency (Chang et al., 2023, Shon et al., 13 Jun 2024, Bai et al., 22 Jul 2024).

2. Architectures and Representational Choices

Several prominent architectures utilize discrete tokens in multilingual ASR:

Transformer-based Encoder–Decoder Systems: These ingest sequences of discrete tokens (potentially from multiple modalities: acoustic, semantic, transcript) enabling flexible information fusion via attention mechanisms and cross-attention layers (Erdogan et al., 2023, Shon et al., 13 Jun 2024).
CTC/Attention Hybrid Models: These treat discrete token sequences as direct input for joint CTC and sequence modeling, combining alignment robustness with expressive decoding (Chang et al., 11 Jun 2024, Xue et al., 24 Jul 2025).
Weighted-Sum Layer Selection: Multilingual SSL models (e.g., XLS-R) encode varied linguistic content in different layers. Weighted sums of layer outputs are optimized so representation selection is adaptive to language, narrowing the performance gap between discrete and continuous features (Li et al., 2 Sep 2025).

Discrete tokenization can be generated from:

Semantic SSL models (HuBERT, WavLM, XLSR): Emphasize high-level linguistic information and cross-lingual generalization.
Acoustic Compression models (EnCodec, RVQGAN): Focus on reconstructive fidelity and bandwidth efficiency; the low-pass frequency characteristics of RVQGAN facilitate robustness on narrowband speech (Puvvada et al., 2023, Shechtman et al., 10 Oct 2024).

3. Multilingual Adaptation and Language Generalization

The portability of discrete tokens across languages reflects both the universality of SSL models and the language agnosticism of token representations:

Multilingual SSL models (e.g., XLS-R pretrained on 128 languages) generate discrete tokens that are effective across diverse linguistic domains (Li et al., 2 Sep 2025, Cui et al., 13 Sep 2024).
Layer-wise weighted-sum representation enables adaptation of tokenization to language-specific content, addressing heterogeneity in multilingual systems (Li et al., 2 Sep 2025).
Studies demonstrate that discrete tokens can achieve improved or comparable word/character error rates relative to traditional Fbank features in multilingual ASR, with substantial efficiency gains (e.g., training time reductions of 65%) (Cui et al., 13 Sep 2024).
Token-based systems have potential to support universal modeling by using shared codebooks or tailored clustering for each language, although monolingual-specific tokenization may yield optimal results for some cases (Cui et al., 13 Sep 2024).

4. Performance, Efficiency, and Comparative Analysis

Performance metrics reported across multiple benchmarks show that:

Discrete token systems achieve competitive WER/CER: e.g., XLSR-53 tokens provide up to 41% relative WER reduction over Fbank baselines for Polish test sets and consistent improvements across seven European languages (Cui et al., 13 Sep 2024).
Storage and computational efficiency: Discrete tokens dramatically reduce data size (e.g., 1,000 hours of speech compressed from ∼100GB to <1GB), with per-epoch training times halved due to sequence length reduction (Chang et al., 2023, Chang et al., 11 Jun 2024).
Bitrate efficiency is a cornerstone for practical deployment, quantified via $B = \sum_{m=1}^{M} \left( \frac{N_m \cdot \log_2(|\mathcal{V}_m|)}{T/S} \right)$ (Chang et al., 11 Jun 2024).
In robust spoken language tasks (e.g., noisy ASR, speech separation), discrete tokens can fall slightly short of continuous feature pipelines, but their efficiency advantages are substantial (Wang et al., 25 Aug 2025).
Hybrid or adaptive systems—combining both discrete and continuous features—could capture both semantic and phonetic nuances for more resilient and informative multilingual recognition (Wang et al., 25 Aug 2025).

5. Special Topics: Accent Robustness and Normalization

Discrete tokens also enable accent-robust and normalization systems:

By training the clustering for tokenization on native speech, discrete tokens encode perceptual biases akin to human interlanguage speech intelligibility benefit (ISIB), increasing ASR accuracy for foreign-accented speech among listeners who share the speaker’s native language (Onda et al., 22 May 2025).
Non-parallel pipelines using self-supervised tokens allow accent normalization, converting accented speech to native-like form while preserving speaker identity. Objective and subjective metrics (e.g., MUSHRA, KL divergence of token-phoneme distributions) confirm effective accent reduction and phonetic fidelity (Bai et al., 23 Jul 2025).
Duration control via simple scaling and flow matching predictors assures accurate temporal alignment in downstream tasks such as dubbing and TTS (Bai et al., 23 Jul 2025).

6. Practical Applications and Future Directions

Applications of discrete token-based multilingual ASR include:

Compact, bandwidth-efficient ASR for resource-constrained devices and low-bit-rate streaming (Puvvada et al., 2023, Shechtman et al., 10 Oct 2024).
Cross-lingual and zero-shot speech recognition, leveraging universal or robust multilingual token representations (Cui et al., 13 Sep 2024, Li et al., 2 Sep 2025).
Integration with LLMs: Instruction-following SpeechLLMs (e.g., DiscreteSLU) demonstrate strong performance using discrete tokens mapped into LLM token spaces, yielding robust instruction-following and cross-language capabilities (Shon et al., 13 Jun 2024).
Accent adaptation and normalization using solely native speech data, increasing accessibility for low-resource accents (Onda et al., 22 May 2025, Bai et al., 23 Jul 2025).

Ongoing research is addressing:

Optimal tokenization strategies: Language-specific or cross-lingual clustering, fusion methods, hierarchical/semantic-acoustic hybrid token design.
Closing the accuracy gap: Advanced data augmentation, learnable quantization, and multi-stage training to further approach or surpass continuous feature-based recognition (Li et al., 2 Sep 2025, Yang et al., 2023).
Scalable joint modeling: Unified transformer architectures (e.g., dMel, RichASR) allow simultaneous ASR and TTS from a shared discrete token backbone, streamlining speech–text generation (Bai et al., 22 Jul 2024).

7. Limitations and Open Challenges

Despite numerous advantages, notable challenges persist:

Cross-domain and noisy environment robustness: Discrete token models may exhibit increased error rates under adverse conditions compared to continuous feature pipelines; further work in robust quantization or adaptive training is required (Wang et al., 25 Aug 2025).
Language adaptation and codebook design: While universal SSL models provide broad generalization, specific languages with distinct phonetic inventories may require tailored tokenization for optimal recognition (Cui et al., 13 Sep 2024, Li et al., 2 Sep 2025).
Token under-training and efficiency: Sparse utilization of large codebooks (long-tailed frequency) could indicate capacity inefficiency and warrants investigation into more compact token design or training paradigms (Wang et al., 25 Aug 2025).
Layer selection and representation fusion: Systematic layer analysis is essential for maximizing token discriminability across languages with different acoustic and semantic characteristics (Li et al., 2 Sep 2025).

In summary, discrete token-based multilingual speech recognition unifies advances in self-supervised speech modeling, quantized representation, and efficient NLP-style encoding. Extensive experimental evidence demonstrates that with appropriate model and tokenization choices, discrete tokens enable competitive or superior performance in ASR, especially as systems scale to large, diverse, and multilingual corpora, while offering significant practical advantages in storage, computation, and modular integration for future speech applications (Chang et al., 2023, Chang et al., 11 Jun 2024, Cui et al., 13 Sep 2024, Li et al., 2 Sep 2025).