Papers
Topics
Authors
Recent
Search
2000 character limit reached

Taiwanese Mandarin Spoken Language Model

Updated 24 April 2026
  • Taiwanese Mandarin SLM is a neural model that integrates modular ASR, tokenization, and diffusion-based synthesis for real-time, high-quality speech processing.
  • It employs advanced techniques like streaming unit segmentation, timestamp interleaving, and phonetic injection to effectively disambiguate Mandarin polyphones.
  • The model leverages large-scale hybrid datasets and multitask training to enhance conversational fluency, robustness, and end-to-end spoken dialogue control.

A Taiwanese Mandarin Spoken LLM (SLM) is a neural architecture designed to process, generate, and understand Taiwanese Mandarin speech, typically incorporating explicit mechanisms for handling unique phonotactic, lexical, and conversational properties of the language. Recent advancements demonstrate sophisticated end-to-end models enabling real-time, multi-turn, speech-to-speech interactions, robust phonetic control, and effective disambiguation of Mandarin polyphones. Such systems combine LLM backbones, specialized tokenization, and advanced acoustic modeling to achieve conversational fluency and high-fidelity synthesis in Taiwanese Mandarin contexts (Yang et al., 2024, Hsu et al., 29 Jan 2025).

1. System Architectures

Contemporary Taiwanese Mandarin spoken LLMs (SLMs) exhibit highly modular, streaming-friendly system designs. The "spml-omni" spoken-LLM pipeline (Yang et al., 2024) utilizes:

  • Streaming ASR: Segments audio every 0.3 s, producing confirmed text tokens via distilled Whisper with VAD and hallucination filtering.
  • Streaming Speech Units: Discrete acoustic features (HuBERT units, 0.1 s chunks), providing rich subword information.
  • Interleaver: Aligns word-level ASR outputs and speech units by timestamps, generating a unified, interleaved token sequence.
  • Decoder-only Transformer: Initialized from LLaMA-3.1 8B, using prompt engineering for conversational and duplex control. No architectural modifications are applied; turn-taking utilizes a special \langleEOT\rangle token.
  • Denoising Diffusion Decoder: Converts predicted discrete units or text to mel-spectrograms.
  • HiFi-GAN Vocoder: Final waveform synthesis for speech output.

A high-level block diagram is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
┌───────────────┐      ┌───────────────┐
│ Streaming ASR │      │  Streaming    │
│   (0.3 s)     │      │ Speech Units  │
└───────────────┘      └───────────────┘
         │                      │
         └──────────────┬───────┘
                        │
                ┌──────────────┐
                │ Interleaver  │
                └──────────────┘
                        │
                ┌──────────────┐
                │ Transformer  │
                └──────────────┘
                /                 \
      ┌─────────────┐      ┌─────────────┐
      │ Extract     │      │ Extract     │
      │ speech unit │      │ text token  │
      └─────────────┘      └─────────────┘
            │                    │
  ┌─────────────────┐    ┌─────────────┐
  │ Mel-spectrogram │    │ Text/Screen │
  │  Diffusion      │    │ Display     │
  └─────────────────┘    └─────────────┘
            │
     ┌──────────────┐
     │  HiFi-GAN    │
     └──────────────┘
            │
        waveform

BreezyVoice (Hsu et al., 29 Jan 2025), a neural TTS system, advances tokenization and synthesis:

  • Supervised Semantic Speech (S³) Tokenizer: Converts raw speech to discrete semantic units by combining convolutional/transformer encoders with VQ, supervised by ASR loss.
  • LLM for Text-to-Unit Generation: Predicts unit sequences from phonetic-augmented text and speaker embeddings.
  • Optimal-Transport Flow-Matching Model (OT-CFM): Maps discrete unit sequences to mel-spectrograms, conditioned on speaker reference and context.
  • Grapheme-to-Phoneme Predictor (g2pW): State-of-the-art contextual disambiguation for Mandarin polyphones.

2. Data Preparation and Augmentation

End-to-end Taiwanese Mandarin SLMs leverage hybrid datasets:

  • Real Dialogues: ~40,000 h of segmented and transcribed multi-turn conversation, processed via diarization, source separation, and ASR correction.
  • Synthetic Dialogues: ~100,000 h generated using GPT-4o (scenario scripting, role-playing, interruptions), then synthesized via fine-tuned CosyVoice-300M TTS (Yang et al., 2024).
  • Token Sequence Construction: Hybrid input mixing text (ASR), discrete units (HuBERT/S³), and explicit phonetic information (g2pW). Alignment is achieved through timestamp-based interleaving.

In BreezyVoice (Hsu et al., 29 Jan 2025), training corpora comprise 300+ h of speech (including manually and LLM-derived labels), balanced for age, gender, region, and pitch. Text datasets include both general monologue (TCMD) and code-switching contexts (TCCSD).

3. Training Objectives and Modalities

SLMs employ multitask, multimodal training strategies:

  • Pre-training: Next-token cross-entropy with ASR (unit→text, unit→hybrid), TTS (text→unit, text→hybrid), and text-only objectives. For "spml-omni," hybrid modality provides sufficient cross-modal alignment; no explicit duration losses are needed (Yang et al., 2024).
  • Supervised Fine-tuning (SFT): Expands hybrid input/output mapping, adding spoken dialogue, unit-hybrid, and hybrid-hybrid tasks (~100 kh).
  • Phonetic Augmentation: BreezyVoice up-weights polyphone loss terms during LLM and g2pW training, injecting disambiguating phonemes at the character level with masking to maintain generalization (Hsu et al., 29 Jan 2025).

Table: Example Pre-training Task Mix (spml-omni, 44kh) | Task | Input | Output | Hours | |------|-------|--------|-------| | ASR | unit | text | 6k | | TTS | text | unit | 6k | | ... | ... | ... | ... |

Hybrid training enables unified handling of both spoken and textual modalities and supports prompt-based turn-taking, full-duplex control, and robust alignment of speech units and text tokens.

4. Phonetic Disambiguation and Polyphone Handling

Polyphone disambiguation is critical for Taiwanese Mandarin due to high ambiguity in character-to-phoneme mapping. BreezyVoice implements:

  • BERT-based g2pW Predictor: Produces context-sensitive phoneme predictions for polyphonic characters using conditional weighted softmax and POS/context features.
  • Phonetic Symbol Injection: Augments input sequences as [y1,p1,y2,p2,][y_1, p_1, y_2, p_2, \dots], allowing the LLM strong control over unit prediction.
  • Loss Margins for Polyphones: Training up-weights polyphonic tokens by factor λpoly=2.0\lambda_{\mathrm{poly}}=2.0 to emphasize correct disambiguation.
  • Empirical Results: On 23 hard polyphone instances, g2pW augmentation reduced error from 8/23 to 1/23 (Hsu et al., 29 Jan 2025).

A plausible implication is that explicit contextual-phonetic injection substantially increases model fidelity for rare and ambiguous character forms.

5. Real-time Interaction and Latency

Streaming and full-duplex capability are central:

  • ASR + Unit Segmentation: Streaming with windowed segmentation (ASR: 0.3 s; Units: 0.1 s).
  • Dynamic Chunking in Decoder: Approximates streaming by scheduling chunked synthesis to precede audio completion by a safety ϵ\epsilon (50 ms typical).
  • Latency Measurements: "spml-omni" reports worst-case end-to-end delay of ≲2.5 s (optimized ≲1.9 s) for speech-to-speech interaction (Yang et al., 2024). Human conversational turn-taking averages ≲500 ms, indicating substantial room for further improvement via neural streaming decoders.

6. Evaluation Protocols and Performance

Automatic and human-centric metrics quantify intelligibility, naturalness, and coherence:

  • Automatic Metrics:
  • Human Preference: In pairwise tests, BreezyVoice wins 100% vs Z, 73.3% vs Y, etc.
  • Code-switching Accuracy: BreezyVoice shows robust handling except in toponyms (e.g., 3/10 score).
  • Live and Agent Evaluations: "Forum" agent↔agent roleplay, live user up/down-vote ratings, and subjective response analysis are ongoing (Yang et al., 2024).

Table: Selected Dialogue Test Results (spml-omni, (Yang et al., 2024)) | Model | Modality | CER (%) | MOS | LLM-Score | |-----------------|----------|---------|-------|-----------| | spml-omni-last | s2s | 28.99 | 3.42 | 3.4 | | spml-omni-last | u2s | 32.80 | 3.51 | 1.8 |

7. Practical Challenges and Future Directions

Key challenges include:

  • Robustness to Long-Tail Speakers: Error analysis shows most synthesis errors originate from LLM unit prediction; "Iconic Unit Augmented Cloning" reduces PER on noisy speakers by 61.2% with minor speaker similarity loss (Hsu et al., 29 Jan 2025).
  • Polyphone and Rare-word Disambiguation: Explicit g2pW injection corrects the majority of ambiguous instances and rare-characters.
  • Streaming Limitations: Decoders are not fully streaming; future architectures must address this gap relative to conversational human latency.
  • Ethical Concerns: Few-shot cloning incurs spoofing risks; anti-spoofing and consent protocols are essential in deployment (Hsu et al., 29 Jan 2025).
  • Limitations: Toponyms in code-switching contexts, optimal phonetic control integration in neural decoders, and meta-learning for long-tail speaker adaptation are identified as ongoing research targets.

Taken collectively, these efforts demonstrate the viability of end-to-end, duplex-capable, highly controllable spoken LLMs for Taiwanese Mandarin, setting benchmarks for fluency, intelligibility, and phonetic precision in both research and applied domains (Yang et al., 2024, Hsu et al., 29 Jan 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Taiwanese Mandarin Spoken Language Model.