Papers
Topics
Authors
Recent
Search
2000 character limit reached

InstructTTSEval-Zh: Mandarin TTS Benchmark

Updated 17 January 2026
  • InstructTTSEval-Zh is a benchmark that assesses Chinese TTS systems’ ability to execute complex, free-form instructions with precise paralinguistic control.
  • It leverages 1,000 curated instruction–audio pairs partitioned into three tasks—APS, DSD, and RP—that test both low-level acoustic parameters and high-level contextual cues.
  • LLM-based evaluation reveals that while commercial systems excel in structured prompts, challenges remain in abstract role-play and dynamic emotional synthesis.

InstructTTSEval-Zh is the Chinese subset of the InstructTTSEval benchmark suite, designed to measure the ability of modern text-to-speech (TTS) systems to follow complex, free-form natural-language style instructions, with a particular focus on nuanced paralinguistic control and instruction adherence in Mandarin Chinese speech synthesis (Huang et al., 19 Jun 2025). It is characterized by a structured set of style-controlled tasks, robust automatic evaluation using large-LLMs as judges, and coverage of both low-level acoustic parameters and high-level social or contextual instructions.

1. Task Taxonomy and Benchmark Design

InstructTTSEval-Zh comprises 1,000 Chinese instruction–reference-audio pairs drawn from expressive media sources, partitioned into three hierarchical tasks that target progressively more abstract dimensions of paralinguistic control:

  1. Acoustic-Parameter Specification (APS): APS tests the model’s precise control over twelve predefined acoustic features, spanning physiological (e.g., gender, pitch), linguistic (clarity, fluency, speed), social (accent, age, volume), and psychological/pragmatic (emotion, tone, personality) tiers. Each APS instruction is a free-form caption, generated by Gemini (gemini-2.5-pro), that matches all these attributes in the paired reference audio, enabling direct assessment of feature-level synthesis fidelity.
  2. Descriptive-Style Directive (DSD): DSD evaluates a system’s generalization from structured prompts to under-specified, unstructured ones. Instructions are derived by rewriting APS prompts with GPT-4o, randomly omitting 20–50% of the feature mentions to create varying degrees of underspecification. This requires models to infer omitted attributes holistically and map qualitative stylistic descriptions to appropriate prosodic patterns.
  3. Role-Play (RP): RP probes contextual and social reasoning, with prompts requiring the model to perform according to character roles or scene settings (e.g., “angry officer,” “anxious interviewee”). Instructions are generated via GPT-4o chain-of-thought prompts that abstract from explicit feature values to a scenario, requiring the TTS system to infer suitable paralinguistic realization.

All three tasks share the same 1,000 base audio segments, ensuring controlled ablation of explicitness and complexity across instruction types (Huang et al., 19 Jun 2025).

2. Data Collection and Annotation Pipeline

Source Material:

Half the data is drawn from the NCSSD (presumably a Chinese expressive speech dataset), and the remainder from Chinese movies, TV dramas, and variety shows. Audio is segmented and diarized to isolate single-speaker clips of ≤30 seconds using pyannote.audio. Transcriptions are produced with whisper-large-v3, followed by punctuation and quality filtering (DNSMOS ≥2.8, minimum segment duration and text length, WhisperD to ensure single-speaker validity).

Expressiveness Filtering:

Segments are further filtered via the DVA toolkit, with only those exhibiting both Dominance >0.8 and Arousal >0.8 retained. Final statistics for the ZH subset are:

Subset # Segments Duration (h)
NCSSD 500 0.93
Media 500 1.61
Total 1,000 2.54

Instruction Construction:

  • APS: Gemini produces sentences naming all 12 features, including any temporal modulation.
  • DSD: GPT-4o rewrites these captions into style directives, randomly omitting features to generate three dropout-weight variations.
  • RP: GPT-4o uses scenario inference (chain-of-thought prompting) to abstract specific features into role- or scene-based prompts, explicitly omitting labels.

All instruction–audio pairs are thus backed by expressively annotated and contextually diverse reference material, with manual and LLM-based quality control ensuring robust coverage (Huang et al., 19 Jun 2025).

3. Evaluation Protocol and Scoring

Automatic LLM-as-Judge:

For each instruction-synthesis pair, Gemini is used as an automatic judge to assess alignment: it produces a binary label (“true” if primary style attributes align, “false” otherwise), guided by a detailed rubric. This LLM-as-judge methodology enables scalable, replicable, and cost-efficient evaluation, and was empirically shown to match human rater agreement levels across all task types (APS: 88%, DSD: 80%, RP: 76%; overall: 81.3%) (Huang et al., 19 Jun 2025).

Scoring Metric:

For each task, style-adherence is the macro-average of Gemini’s binary judgments over 1,000 items:

Stask=1Ni=1N1(Aligni)S_{\mathrm{task}} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\text{Align}_i)

where Aligni=1\text{Align}_i=1 if Gemini judges sample ii “true,” and 0 otherwise.

4. Benchmark Results and Model Comparisons

Performance on InstructTTSEval-Zh is summarized below, covering both commercial (closed-source) and open-source TTS systems alongside the “oracle” score of the original human audio:

Model APS (%) DSD (%) RP (%) Avg (%)
reference_audio 90.9 86.7 69.8 82.5
gemini-flash 88.2 90.9 77.3 85.4
gemini-pro 89.0 90.1 75.5 84.8
gpt-4o-mini-tts 54.9 52.3 46.0 51.1
VoxInstruct 47.5 52.3 42.6 47.5

Observations:

  • Closed-source Gemini models approach the oracle upper bound on APS/DSD but lag on RP, reflecting persistent difficulties in abstract social role inference.
  • GPT-4o-mini-tts shows sharp degradation for fine-grained acoustic control (APS) and contextual complexity (RP).
  • VoxInstruct (open-source) matches gpt-4o-mini-tts in DSD/RP but shows limited APS adherence, indicating difficulties with explicit low-level feature control.
  • Commercial TTS models often display non-native prosody in Mandarin Chinese, plausibly due to predominantly English training data (Huang et al., 19 Jun 2025).

5. Analysis of Strengths, Weaknesses, and Task Properties

Strengths:

  • Gemini-flash/pro robustly follow explicit APS-style feature lists (APS >88%).
  • DSD generalization is strong for commercial systems (~90%), indicating capacity to interpolate between seen style configurations.

Weaknesses:

  • Role-play remains challenging: even reference audio achieves only 69.8%, with the best TTS models achieving 77.3%.
  • Extreme paralinguistic phenomena (sighs, screams), dynamic emotion transitions, and singing are not adequately reproduced.
  • Open-source models display orthogonal strengths, able to imitate certain timbral qualities (e.g., child-like) yet failing in emotional nuance and scenario inference.

These findings highlight both the granularity of InstructTTSEval-Zh and the limits of current TTS models in modeling abstract, context-dependent expressive speech in Mandarin.

6. Methodological Insights and Recommendations

The InstructTTSEval-Zh analysis prompts several recommendations for advancing instruction-following TTS in Chinese (Huang et al., 19 Jun 2025):

  • Unified Timbre–Emotion Modeling:

Existing systems often display trade-offs between flexible timbre control and emotional expressiveness. Joint modeling of speaker identity (timbre) and affective dynamics is needed for comprehensive style coverage.

  • Dynamic Event Synthesis:

Paralinguistic events (e.g., laughter, screams) should be promoted to first-class outputs, demanding explicit annotation and generation benchmarks.

  • Enhanced Role-Inference:

Current bottlenecks in RP tasks reflect a gap in abstract social reasoning, which may be addressed by integrating semantic scenario understanding into the prosody generation pipeline.

  • Chinese-Native Prosody Adaptation:

Commercial TTS systems benefit from fine-tuning on Mandarin corpora to restore native tonal contours and authentic rhythmic patterns, counteracting biases from English-centric pretraining.

  • Cost-Efficient Automatic Evaluation:

While Gemini enables reliable evaluation, its resource demands suggest a need for open-source multi-modal judges with strong correlation to human ratings for accelerated research cycles.

  • Style Coverage Expansion:

Addressing data imbalance by enriching underrepresented emotions and roles (e.g., comedic irony, calm urgency) would yield a more uniform and diagnostic benchmark.

7. Significance and Future Directions

InstructTTSEval-Zh provides a rigorous, automatic, multi-task framework for evaluating instruction-driven Mandarin TTS systems along both explicit acoustic and abstract expressive axes. The separation of APS, DSD, and RP tasks allows disentanglement of low-level and high-level style control, revealing distinct strengths and failure modes across system classes (Huang et al., 19 Jun 2025).

A plausible implication is that as TTS research migrates toward open-ended style and contextual control, benchmarks such as InstructTTSEval-Zh—paired with instruction fidelity metrics and robust LLM-as-judge scoring—will drive the methodological evolution required for genuinely expressive and human-like synthetic speech in Chinese.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InstructTTSEval-Zh.