Papers
Topics
Authors
Recent
2000 character limit reached

Wu Dialect Speech Processing Ecosystem

Updated 23 January 2026
  • The Wu dialect speech processing ecosystem is a comprehensive platform featuring an 8,000-hour corpus, benchmark suite, and state-of-the-art models for dialect speech AI.
  • It employs robust quality control, detailed annotations, and multi-task learning to support ASR, translation, speaker attributes, emotion prediction, TTS, and controllable instruction-following systems.
  • Empirical analyses show competitive performance across tasks, setting strong baselines for dialect and low-resource speech research while addressing sub-dialect imbalances.

The Wu dialect speech processing ecosystem comprises a unified, large-scale corpus, benchmark suite, and model release targeting inclusive speech technology for the linguistically important but historically underserved Wu dialect (a group of mutually intelligible Sinitic lects spoken by a large population in Eastern China). The core components are built around the open-source WenetSpeech-Wu dataset (≈8,000 hours of annotated audio), the WenetSpeech-Wu-Bench evaluation standards, and high-performing models covering ASR, Wu-to-Mandarin translation, speaker attribute and emotion prediction, text-to-speech (TTS), and controllable instruction-following TTS. The ecosystem provides an end-to-end pipeline, competitive baselines, and a reusable template for research in dialectal and low-resource speech AI, supporting both rigorous scientific benchmarking and practical deployment (Wang et al., 16 Jan 2026).

1. Large-Scale Wu Dialect Corpus Construction

The WenetSpeech-Wu corpus constitutes approximately 8,000 hours (3.86 million utterances) of Wu-dialect speech, spanning diverse domains such as news, culture, vlogs, entertainment, education, podcasts, commentary, interviews, radio drama, music programs, and audiobooks. Speech covers eight major Wu sub-dialects—Shanghainese, Suzhounese, Shaoxingnese, Ningbonese, Hangzhounese, Jiaxingnese, Taizhounese, and Wenzhounese—with 37% labeled “Unknown” where sub-dialect assignment is ambiguous.

Speaker-level annotations include 4,135 hours of male and 1,331 hours of female speech, and demographic age groups: teenagers (0–17, 372 h), youth (18–35, 1,673 h), middle-aged (36–59, 2,003 h), and elderly (60+, 1,418 h). Emotional tags were assigned via a cross-modal ensemble pipeline, annotating 5,102 hours as neutral, 73 as happy, 81 as sad, 109 as angry, and 101 as surprised.

The annotation protocol leverages weighted-ROVER fusion of four complementary ASR systems to generate transcripts (confidence ≥ 0.55) and combines lexicon mapping with LLM-based refinement for Wu-to-Mandarin translations. Sub-dialect and domain labels use metadata filtering. Speaker (gender/age) is assigned via VoxProfile; multi-speaker detection uses Pyannote. Prosodic features (rate, loudness, energy, pitch) are extracted using Dataspeech, while SNR (10–40 dB) and DNSMOS MOS (2.0–3.5) respectively measure audio quality.

Data splitting introduces two quality tiers—“Mid” (for pre-training, conf > 0.6) and “High” (for supervised fine-tuning, conf > 0.85 and stricter mono-speaker/quality constraints)—with task-specific subsets for each end-use, as summarized:

Task Duration/Utterances Quality Filters
ASR-Mid 7,388 h conf > 0.6, multi
ASR-High 795 h conf > 0.85, mono
AST 795 h High, mono, superv. filtered
Spk-attr 2,986 h mono
Emo 500 h mono, SNR > 10 dB, pitch-std > 50
TTS-Mid 7,388 h conf > 0.6
TTS-High 1,500 h conf > 0.65, mono, SNR > 10, pitch-std > 50
Instruct-Pro 679 h conf > 0.7, mono, SNR > 30
Instruct-Emo 161 h Expressive, strictly filtered

Quality control and well-documented JSON schemas are integral to ecosystem reusability and extendability.

2. Comprehensive Benchmark Suite: WenetSpeech-Wu-Bench

WenetSpeech-Wu-Bench comprises the first manually curated, task-specific public benchmarks for Wu dialect speech research. Test sets span six core tasks:

  • Automatic Speech Recognition (ASR): 4,851 utterances (9.75 h), Shanghainese/Suzhounese/Mandarin-mixed.
  • Wu-to-Mandarin Speech Translation (AST): 3,000 utterances (4.4 h), paired with expert Mandarin references.
  • Speaker Attribute Prediction: 3,000 utterances (balanced for gender and three age groups).
  • Emotion Recognition: 1,000 utterances, annotated for five categories (Neutral, Happy, Sad, Angry, Surprised).
  • Text-to-Speech (TTS): 144 easy and 98 hard utterances targeting 12 speakers, evaluated for both objective and subjective attributes.
  • Instruction-following TTS (Instruct-TTS):
    • Prosody control: 5 prompts × 20 variations, evaluated for adherence to pitch/rate instructions.
    • Emotion control: 10 prompts × 50 samples per emotion, using automated classifiers and subjective mean opinion scores (MOS).

Metrics are standardized for each task. ASR is evaluated via character error rate (CER), AST by BLEU, speaker and emotion classification by accuracy, and TTS by intelligibility (CER, IMOS), speaker similarity (cosine SIM, SMOS), and accent (AMOS). Instruct-TTS adds automated and subjective measures of prosody (PMOS, accuracy) and emotion control (EMOS).

3. Model Architectures and Training Frameworks

Three primary ASR architectures are released:

  • Conformer-U2pp-Wu: Encoder-decoder Conformer (12 blocks, dimension 512, 8 heads, 123M parameters) trained from scratch within the WeNet framework.
  • Whisper-Medium-Wu: 24-layer Transformer encoder-decoder (model dimension 1024, 16 heads, 769M parameters), fine-tuned from Whisper-Medium.
  • Step-Audio2-Wu-ASR: Step-Audio2-mini (12 transformer blocks, 512-dim, 7B parameters) with LoRA adaptation (+10M parameters).

The ASR objective is a weighted sum of CTC and cross-entropy,

LASR=λLCTC+(1λ)LCE.\mathcal{L}_{ASR} = \lambda\,\mathcal{L}_{CTC} + (1-\lambda)\,\mathcal{L}_{CE}.

A unified speech understanding model, Step-Audio2-Wu-Und, is built on the Step-Audio2-mini encoder with multi-task heads for ASR, AST, speaker, age, and emotion classification. Pre-training is performed on Mid and High-tier subsets, followed by joint supervised fine-tuning.

TTS is handled by CosyVoice2 (streaming Transformer-TTS, 12 encoder and decoder layers, 30M parameters) with LLM-based prosody control. Training comprises continued pre-training (CPT) on TTS-Mid for 10 epochs, supervised fine-tuning (SFT) on TTS-High for 3 epochs, and speaker-specific (SS) fine-tuning (10 h per speaker). The loss combines L1L_1 waveform loss and MSE for duration, pitch, and energy:

LTTS=yy^1+αdd^22+βpp^22+γee^22.\mathcal{L}_{TTS} = \|y-\hat y\|_1 + \alpha\,\|d-\hat d\|_2^2 + \beta\,\|p-\hat p\|_2^2 + \gamma\,\|e-\hat e\|_2^2.

Instruction-following TTS extends CosyVoice2 through further fine-tuning on tailored prompt-annotated data for prosody and emotion.

Training employs large-batch (up to 60k frames) schedules and warmups with task-appropriate learning rates. Recipes and checkpoints are open-sourced.

4. Empirical Findings and Analysis

The released models set strong benchmarks across all tasks:

  • ASR: Step-Audio2-Wu-ASR achieves CER 12.85%, outperforming Conformer-U2pp-Wu (15.14%) and Whisper-Medium-Wu (14.33%), with external baselines (e.g., Qwen3-ASR) in the high-20 to 30% CER range.
  • Unified Speech Understanding: Step-Audio2-Wu-Und delivers CER 13.23%, BLEU 53.13, speaker gender 95.6%, age 72.9%, emotion 71.2%; exceeding Qwen3-Omni’s CER 44.27%, BLEU 33.31, gender 97.7%, age 54.1%, emotion 66.7%.
  • TTS: CosyVoice2-Wu-SS achieves CER (easy/hard) 5.42%/15.45%, IMOS 4.37/4.04, AMOS 4.21/3.88, competitive with Qwen3-TTS.
  • Instruction-following TTS: Notable gains are realized in both accuracy and MOS: pitch/rate prosody control improves from ≈0.25 to ≈0.80, PMOS rises from 2.13 to 3.68, emotion-control accuracy (e.g., 'Happy') from 0.87 to 0.94, EMOS from 3.66 to 3.83.

Error analysis identifies primary bottlenecks in ASR from under-represented sub-dialects and tone-sandhi effects. Multi-task learning reduces confusion in speaker and emotion tasks by exploiting shared representations.

5. Integration, Reusability, and Future Directions

The ecosystem’s open release comprises all 8,000 hours of audio with rich annotations, WenetSpeech-Wu-Bench, model checkpoints, and scalable training recipes, all under Apache-2.0 (GitHub: ASLP-lab/WenetSpeech-Wu-Repo). Data stratification (Mid/High), quality grading, and the JSON schema establish a transportable platform for dialectal AI research.

Key avenues for further advancement include:

  • Manual or crowdsourced transcript curation to enhance ground-truth quality
  • Targeted collection to improve sub-dialectal balance, e.g., Jiaxingnese, Taizhounese
  • Extending the pipeline and benchmark design to additional Chinese (or non-Chinese) dialects
  • Leveraging parallel Mandarin resources for cross-dialect transfer

Collectively, WenetSpeech-Wu and its associated resources constitute the first comprehensive, open, and systematically benchmarked ecosystem for Wu dialect speech technology, providing baselines, evaluation standards, and an extensible template for broader dialectal and low-resource speech AI research (Wang et al., 16 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wu Dialect Speech Processing Ecosystem.