Two-Stage Phoneme-Centric Model (TSPC)

Updated 14 September 2025

Two-Stage Phoneme-Centric Model (TSPC) is an architecture that separates speech modeling into acoustic-to-phoneme and phoneme-to-text stages.
It enhances linguistic transparency and adaptability, enabling robust handling of code-switching and cross-lingual scenarios.
Empirical results show TSPC significantly reduces error rates in low-resource and mixed-language settings compared to end-to-end systems.

A Two-Stage Phoneme-Centric Model (TSPC) is an architecture that decomposes speech or sequence modeling tasks into two distinct stages, where phoneme-level representations serve as an explicit or latent intermediate. This paradigm is characterized by the separation of acoustic-to-phoneme (or visual-to-phoneme) transformation from phoneme-to-text (or further semantic) reconstruction, enabling modularity, linguistic transparency, and enhanced performance for tasks involving code-switching, cross-linguistic transfer, multimodal recognition, and low-resource domains.

1. Architectural Principles and Motivation

The TSPC framework consists of two principal modules: Stage 1 converts raw input (audio, visual, or other sensor streams) into a phoneme sequence using a dedicated sequence-to-sequence (Seq2Seq) or discriminative model. Stage 2 translates the phoneme sequence into target representations, such as text, words, or semantic content, often through another Seq2Seq architecture—frequently using advanced models like T5, NLLB, or Transformer-based text decoders.

This separation is motivated by several distinct advantages:

It enables explicit modeling of phonological structures, which are critical in diverse linguistic contexts where end-to-end mapping from signal to text is insufficient, such as code-switching (Vietnamese-English or other language pairs with intersecting phonologies).
Through the use of phoneme-centric intermediates, TSPC architectures can more robustly model subtle phonological shifts, reduce ambiguity, and facilitate language conversion, particularly in mixed-lingual scenarios (Nguyen et al., 7 Sep 2025).
The modular design allows for phoneme adaptation and better transfer learning (by leveraging unified or merged phoneme sets across languages), increasing generalizability in low-resource settings (Lee et al., 2023).

2. Phoneme-Centric Intermediate Representation

A defining feature of TSPC is the use of a language-specific or cross-lingual phoneme set as an intermediate layer. Typical designs include:

The adoption of an extended or customized phoneme inventory (e.g., modified Vietnamese syllable-based set, mapped to cover both Vietnamese and English IPA forms).
Conversion pipelines that map input words or speech to phoneme sequences, resolving ambiguous pronunciation (such as English “a” mapped to Vietnamese “Ấy”).
For code-switching scenarios, phoneme-level mapping facilitates the bridging of phonological inventories, reducing recognition errors caused by phonetically similar sounds in different languages (Nguyen et al., 7 Sep 2025).

This representation not only aids in cross-lingual modeling (by standardizing phonological forms) but also enhances the system’s ability to handle phoneme adaptation (via context-based conversion routines) and language conversion (by leveraging a phoneme-to-text translation step that incorporates both acoustic and linguistic priors).

3. Model Components and Training Strategies

TSPC architectures are instantiated using pre-trained and fine-tuned modules for each stage:

Stage 1 (Speech-to-Phone; S2P): Typically employs a Seq2Seq model (e.g., Transformer or PhoWhisper-based encoder-decoder stack) trained to output phoneme sequences from acoustic features. In complex cases, beam search is used to optimize phoneme output distributions with a trade-off between diversity and accuracy.
Stage 2 (Phone-to-Text; P2T): Utilizes a powerful text generation module such as a T5-based Transformer. Here, the encoder embeds phoneme sequences, and cross-attention mechanisms integrate acoustic knowledge from the S2P stage.

A technical detail underpinning performance is the use of cross-attention, wherein the S2P decoder’s hidden states directly inform the P2T encoder’s intermediate queries:

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$

where $Q$ is the query from the P2T encoder’s self-attention, $K$ and $V$ are derived from S2P decoder outputs, and $d$ is the model’s dimension (Nguyen et al., 7 Sep 2025).

Training involves staged or joint fine-tuning. In some setups, the S2P and P2T modules are frozen during joint training to preserve linguistic priors and reduce noise propagation.

4. Performance in Code-Switching and Low-Resource Settings

Empirical results demonstrate that TSPC architectures consistently outperform direct end-to-end ASR systems in code-switched and resource-limited contexts. For mixed Vietnamese-English ASR:

Whisper-base yields 59.45% WER on code-switched data.
Wav2vec2VN achieves 38.06% WER.
PhoWhisper-base improves to 27.9% WER.
TSPC variants further reduce error to 25.35% (Zero, no joint tuning), 21.34% (unfreeze), and 20.8% (freeze)—even when trained with fewer speech corpus hours (Nguyen et al., 7 Sep 2025).

For Vietnamese-only speech, where Whisper-base averages 74.83% WER, TSPC achieves 17.93%—highlighting its strength when adapted to single-language contexts with restricted resources.

This suggests that the phoneme-centric intermediate acts as an information bottleneck, preserving linguistically relevant distinctions and facilitating error correction downstream.

5. Phoneme Adaptation and Language Conversion

The two-stage nature of TSPC enables mechanisms for:

Phoneme Adaptation: Mapping acoustically similar but lexically distinct pronunciations into a shared phoneme inventory—serving as a bridge for mixed-lingual modeling.
Language Conversion: The explicit P2T stage translates phoneme sequences into text, leveraging both statistical and linguistic constraints to resolve ambiguities—especially valuable when words from one language are pronounced according to another language’s phonological rules.

Context-sensitive adaptation (such as mapping English syllables to Vietnamese phoneme forms) and carefully designed conversion pipelines allow robust handling of cross-language accents, tones, and phonetic forms.

6. Technical Considerations and Limitations

TSPC implementations depend strongly on:

The quality and coverage of the phoneme set (for mixed or cross-language application, careful extension and mapping are required).
The efficacy of the cross-attention mechanisms connecting S2P and P2T modules.
Training strategies, including whether modules are frozen or fine-tuned jointly, and choices about beam search width.

A challenge arises in balancing beam size for phoneme output: wider beams can improve accuracy for decoupled models but may counterintuitively increase WER in joint fine-tuning scenarios due to adaptation to greedy outputs.

A plausible implication is that model integration and parameter optimization require task-specific tuning, especially in complex code-switching environments.

7. Implications and Future Directions

The TSPC approach sets a precedent for modular, phoneme-centric modeling in ASR. Its success in code-switched Vietnamese-English ASR (Nguyen et al., 7 Sep 2025) and robust transfer in low-resource settings (Lee et al., 2023) suggests extensibility to other language pairs, especially those with overlapping or complex phonetic inventories.

Future work may explore dynamic phoneme conversion, adaptive beam size strategies, more generalized cross-attention integration, and broader applicability to dialectal, accented, or multilingual speech. By leveraging explicit intermediate linguistic representations and modular architecture, TSPC forms a foundation for next-generation ASR models capable of handling the intricate realities of global spoken language.