OV-InstructTTS: Open-Vocabulary TTS
- OV-InstructTTS is an open-vocabulary, instruction-driven TTS system that employs a reasoning chain to infer emotional, acoustic, and paralinguistic attributes from natural language instructions.
- It integrates a robust dataset with multi-stage annotation and enriched transcripts, enabling the synthesis of expressive, contextually-matched speech.
- The modular framework and rigorous evaluation metrics push the boundaries of achieving nuanced and narrative-driven speech synthesis.
OV-InstructTTS is a paradigm for open-vocabulary instruction-driven text-to-speech synthesis, designed to generate speech that faithfully follows complex, high-level natural language instructions rather than rephrased acoustic labels. Central to its architecture is a reasoning-driven framework that explicitly infers emotional, acoustic, and paralinguistic attributes from flexible user instructions, then synthesizes expressive, contextually-matched speech. This marks a significant departure from prior InstructTTS approaches that relied heavily on restrictive label mappings and limited instruction comprehension. The OV-Speech dataset, developed for OV-InstructTTS, includes rich annotation pipelines, reasoning chains, and paralinguistic event tagging, enabling robust training and evaluation in open-vocabulary scenarios (Ren et al., 4 Jan 2026).
1. Task Definition and Motivation
OV-InstructTTS addresses the challenge of synthesizing speech that matches both the textual content and a natural language instruction . The objective is formally:
where denotes the target speaker identity. Unlike previous systems (e.g., PromptTTS, VoxInstruct, InstructAudio), which primarily map instructions to a limited set of acoustic labels (e.g., pitch, rate, basic emotion), OV-InstructTTS targets open-vocabulary, situational, and narrative instructions (e.g., “Read this line as if you’re narrating a suspenseful chase through a rain-soaked alley”). This approach directly supports creative production scenarios, where content creators require intuitive, story-driven control over speech synthesis (Ren et al., 4 Jan 2026).
2. OV-Speech: Dataset Construction and Annotation Pipeline
OV-Speech was constructed atop ContextSpeech (476.8 hours, 86 speakers, novel-aligned utterances) with a comprehensive multi-stage pipeline:
- Contextual Extraction: For each utterance, Qwen3-32B extracts five contextual elements (environment, current event, speaker personality, interlocutor state, speaker intent) from surrounding text.
- Instruction Generation: Using randomly sampled context elements, Qwen3-32B produces “director-style” open-vocabulary instructions, yielding three instructions per utterance (950K (I,T,X) pairs for training).
- Consistency Filtering: Deepseek-R1 and Qwen3-32B validate emotional and acoustic consistency, discarding pairs with insufficient annotation match.
- Reasoning Process Annotation: For each retained sample, Qwen3-32B generates a two-stage reasoning chain: (1) instruction deconstruction (mapping to context), (2) attribute inference (emotional label , acoustic descriptors , paralinguistic events ).
- Paralinguistic Event Tagging: Qwen2-Audio-7B (fine-tuned on NVSpeech170k) detects 18 paralinguistic events (e.g., [Breathing], [Laughter]), while PC-PTI enables highly accurate transcript integrity (0.35% S-CER) (Ren et al., 4 Jan 2026).
The final dataset statistics are:
| Parameter | Train | Test | Notes |
|---|---|---|---|
| Utterances | 316,807 | 1,500 | Test from held-out novels, unseen speakers |
| Instructions/utt. | 3 | 3 | Avg. length 15 words () |
| Emotional categories | — | 7 | neutral, joyful, sad, angry, fear, surprise, contempt |
| Acoustic descriptors | — | 8 | e.g. low pitch, slow rate, staccato rhythm |
| Paralinguistic tags | — | 18 | Event detection via Qwen2-Audio-7B |
3. Reasoning-Driven Framework Architecture
The OV-InstructTTS-TEP (“text + enriched transcript + prosody,” Editor's term) framework comprises two cascaded modules:
- Instruction Encoder & Reasoning Module: Takes as input, produces a reasoning chain that identifies context-invoked elements and infers explicit attributes .
- Interleaved Generator: Enriches the transcript by appending emotion and paralinguistic tags: .
- TTS Generation: Synthesizes discrete audio tokens conditioned on .
- Neural Vocoder: HiFiGAN converts generated tokens to waveform .
The formal reasoning and synthesis steps are:
This explicit intermediate reasoning process enables the mapping from high-level instruction to low-level acoustic features, supporting both direct and indirect user goals (Ren et al., 4 Jan 2026).
4. Training Objectives and Loss Functions
Multi-task supervised fine-tuning (SFT) is performed on OV-Speech with the following composite objective:
- : cross-entropy loss on reasoning tokens
- : cross-entropy loss on enriched transcript tokens
- : cross-entropy on discrete audio tokens
- : regression on continuous latent audio representations
- : adversarial and feature-matching loss for HiFiGAN vocoder
Training is staged, beginning with reasoning and transcript, then incorporating full audio token prediction (Ren et al., 4 Jan 2026).
5. Speech Synthesis Module Details
- Acoustic Model: Step-Audio-2-mini-Base transformer predicts discrete audio tokens auto-regressively, conditioned on prior tokens, enriched text, and concatenated attribute vectors . A flow-matching regression aligns latent states and codec latents:
- Neural Vocoder: HiFiGAN transforms detokenized codec features or predicted mel-spectrograms to waveform. It is trained with a mix of adversarial, feature-matching, and mel-reconstruction objectives:
where
6. Evaluation Protocol and Results
OV-InstructTTS introduces a suite of metrics targeting both instruction fidelity and conventional TTS benchmarks:
- Instruction Following: Gemini Score (0–100) from Gemini-2.5 LLM judge; Gemini Rank (lower is better among 6 variants).
- Intelligibility: Character Error Rate (CER) via Paraformerzh ASR.
- Speaker Similarity: Cosine similarity (SIM) of WavLM-large embeddings.
- Subjective: MOS for naturalness (1–5), ICMOS for instruction consistency (1–5).
In comparisons against CosyVoice2, GPT4o, Higgs Audio V2, and Step-Audio-2-mini, OV-InstructTTS-TEP achieves state-of-the-art performance:
| System | Gemini Score | Gemini Rank | CER | SIM | MOS (±0.14) | ICMOS (±0.17) |
|---|---|---|---|---|---|---|
| OV-InstructTTS-TEP | 70.42 | 3.39/6 | 3.61% | 0.722 | 4.28 | 3.91 |
| CosyVoice2 | 65.10–68.31 | — | 3.09% | — | — | — |
| Ground-truth | — | — | — | — | 4.10 | — |
Ablations indicate the necessity of both the reasoning chain and the enriched transcript module for top performance. Qualitative analysis shows the system handles nuanced instructions, extracting emotions (e.g., doubt, contempt), prosodic cues (slower rate, pitch contours), and paralinguistic signals ([Breathing]) in generated speech (Ren et al., 4 Jan 2026).
7. Context and Impact
OV-InstructTTS establishes a new direction for instruction-based TTS by bridging the semantic gap between open-domain user intent and low-level synthesis controls. By structuring the input–reasoning–generation pipeline, it transcends prior label-rephrasing paradigms, making complex narration and dialogue synthesis more accessible for content creation without requiring granular acoustic tuning. Its unified dataset and architecture provide replicable benchmarks, supporting ongoing research into generalizable, user-centric TTS systems. The publicly available dataset and demos facilitate further study and practical application in expressive speech technologies (Ren et al., 4 Jan 2026).