Papers
Topics
Authors
Recent
2000 character limit reached

OV-InstructTTS: Open-Vocabulary TTS

Updated 11 January 2026
  • OV-InstructTTS is an open-vocabulary, instruction-driven TTS system that employs a reasoning chain to infer emotional, acoustic, and paralinguistic attributes from natural language instructions.
  • It integrates a robust dataset with multi-stage annotation and enriched transcripts, enabling the synthesis of expressive, contextually-matched speech.
  • The modular framework and rigorous evaluation metrics push the boundaries of achieving nuanced and narrative-driven speech synthesis.

OV-InstructTTS is a paradigm for open-vocabulary instruction-driven text-to-speech synthesis, designed to generate speech that faithfully follows complex, high-level natural language instructions rather than rephrased acoustic labels. Central to its architecture is a reasoning-driven framework that explicitly infers emotional, acoustic, and paralinguistic attributes from flexible user instructions, then synthesizes expressive, contextually-matched speech. This marks a significant departure from prior InstructTTS approaches that relied heavily on restrictive label mappings and limited instruction comprehension. The OV-Speech dataset, developed for OV-InstructTTS, includes rich annotation pipelines, reasoning chains, and paralinguistic event tagging, enabling robust training and evaluation in open-vocabulary scenarios (Ren et al., 4 Jan 2026).

1. Task Definition and Motivation

OV-InstructTTS addresses the challenge of synthesizing speech X^\hat X that matches both the textual content TT and a natural language instruction II. The objective is formally:

X^=argmaxXP(XT,I,s)\hat X = \arg\max_X P\bigl(X \mid T, I, s\bigr)

where ss denotes the target speaker identity. Unlike previous systems (e.g., PromptTTS, VoxInstruct, InstructAudio), which primarily map instructions to a limited set of acoustic labels (e.g., pitch, rate, basic emotion), OV-InstructTTS targets open-vocabulary, situational, and narrative instructions (e.g., “Read this line as if you’re narrating a suspenseful chase through a rain-soaked alley”). This approach directly supports creative production scenarios, where content creators require intuitive, story-driven control over speech synthesis (Ren et al., 4 Jan 2026).

2. OV-Speech: Dataset Construction and Annotation Pipeline

OV-Speech was constructed atop ContextSpeech (476.8 hours, 86 speakers, novel-aligned utterances) with a comprehensive multi-stage pipeline:

  • Contextual Extraction: For each utterance, Qwen3-32B extracts five contextual elements (environment, current event, speaker personality, interlocutor state, speaker intent) from surrounding text.
  • Instruction Generation: Using randomly sampled context elements, Qwen3-32B produces “director-style” open-vocabulary instructions, yielding three instructions per utterance (\sim950K (I,T,X) pairs for training).
  • Consistency Filtering: Deepseek-R1 and Qwen3-32B validate emotional and acoustic consistency, discarding pairs with insufficient annotation match.
  • Reasoning Process Annotation: For each retained sample, Qwen3-32B generates a two-stage reasoning chain: (1) instruction deconstruction (mapping to context), (2) attribute inference (emotional label ee, acoustic descriptors pp, paralinguistic events \ell).
  • Paralinguistic Event Tagging: Qwen2-Audio-7B (fine-tuned on NVSpeech170k) detects 18 paralinguistic events (e.g., [Breathing], [Laughter]), while PC-PTI enables highly accurate transcript integrity (0.35% S-CER) (Ren et al., 4 Jan 2026).

The final dataset statistics are:

Parameter Train Test Notes
Utterances 316,807 1,500 Test from held-out novels, unseen speakers
Instructions/utt. 3 3 Avg. length \approx 15 words (σ5\sigma\approx 5)
Emotional categories 7 neutral, joyful, sad, angry, fear, surprise, contempt
Acoustic descriptors 8 e.g. low pitch, slow rate, staccato rhythm
Paralinguistic tags 18 Event detection via Qwen2-Audio-7B

3. Reasoning-Driven Framework Architecture

The OV-InstructTTS-TEP (“text + enriched transcript + prosody,” Editor's term) framework comprises two cascaded modules:

  • Instruction Encoder & Reasoning Module: Takes (I,T)(I, T) as input, produces a reasoning chain c=[c1,,ck]c = [c_1, …, c_k] that identifies context-invoked elements and infers explicit attributes (e,p,)(e, p, \ell).
  • Interleaved Generator: Enriches the transcript by appending emotion and paralinguistic tags: ytext=[e]T<>y_\mathrm{text} = [e]\,T\,<|\ell|>.
  • TTS Generation: Synthesizes discrete audio tokens yaudioy_\mathrm{audio} conditioned on (ytext,z)(y_\mathrm{text}, z).
  • Neural Vocoder: HiFiGAN converts generated tokens to waveform XX.

The formal reasoning and synthesis steps are:

P(zI)=ReasoningModule(I),z={e,p,}P(z \mid I) = \mathrm{ReasoningModule}(I),\quad z = \{e, p, \ell\}

X^=argmaxXP(XT,z)Vocoder(Detokenize(yaudio))\hat X = \arg\max_X P\bigl(X \mid T, z\bigr) \approx \text{Vocoder}(\text{Detokenize}(y_\mathrm{audio}))

This explicit intermediate reasoning process enables the mapping from high-level instruction to low-level acoustic features, supporting both direct and indirect user goals (Ren et al., 4 Jan 2026).

4. Training Objectives and Loss Functions

Multi-task supervised fine-tuning (SFT) is performed on OV-Speech with the following composite objective:

L=λrLreason+λtLtext+λaLaudio+λfLflow+λgLvocoderL = \lambda_r L_{\mathrm{reason}} + \lambda_t L_{\mathrm{text}} + \lambda_a L_{\mathrm{audio}} + \lambda_f L_{\mathrm{flow}} + \lambda_g L_{\mathrm{vocoder}}

  • LreasonL_{\mathrm{reason}}: cross-entropy loss on reasoning tokens cc
  • LtextL_{\mathrm{text}}: cross-entropy loss on enriched transcript tokens ytexty_\mathrm{text}
  • LaudioL_{\mathrm{audio}}: cross-entropy on discrete audio tokens yaudioy_\mathrm{audio}
  • LflowL_{\mathrm{flow}}: regression on continuous latent audio representations
  • LvocoderL_{\mathrm{vocoder}}: adversarial and feature-matching loss for HiFiGAN vocoder

Training is staged, beginning with reasoning and transcript, then incorporating full audio token prediction (Ren et al., 4 Jan 2026).

5. Speech Synthesis Module Details

  • Acoustic Model: Step-Audio-2-mini-Base transformer predicts discrete audio tokens auto-regressively, conditioned on prior tokens, enriched text, and concatenated attribute vectors zz. A flow-matching regression aligns latent states and codec latents:

Lflow=Ehpredhtrue2L_{\mathrm{flow}} = \mathbb{E}\left\|\mathbf{h}_\mathrm{pred} - \mathbf{h}_\mathrm{true}\right\|^2

  • Neural Vocoder: HiFiGAN transforms detokenized codec features or predicted mel-spectrograms to waveform. It is trained with a mix of adversarial, feature-matching, and mel-reconstruction objectives:

Lmel=M^M1+M^M2L_{\mathrm{mel}} = \|\hat M - M\|_1 + \|\hat M - M\|_2

where

M^=gϕ([Embed(T)z])\hat M = g_\phi([\mathrm{Embed}(T)\,\Vert\,z])

6. Evaluation Protocol and Results

OV-InstructTTS introduces a suite of metrics targeting both instruction fidelity and conventional TTS benchmarks:

  • Instruction Following: Gemini Score (0–100) from Gemini-2.5 LLM judge; Gemini Rank (lower is better among 6 variants).
  • Intelligibility: Character Error Rate (CER) via Paraformerzh ASR.
  • Speaker Similarity: Cosine similarity (SIM) of WavLM-large embeddings.
  • Subjective: MOS for naturalness (1–5), ICMOS for instruction consistency (1–5).

In comparisons against CosyVoice2, GPT4o, Higgs Audio V2, and Step-Audio-2-mini, OV-InstructTTS-TEP achieves state-of-the-art performance:

System Gemini Score Gemini Rank CER SIM MOS (±0.14) ICMOS (±0.17)
OV-InstructTTS-TEP 70.42 3.39/6 3.61% 0.722 4.28 3.91
CosyVoice2 65.10–68.31 3.09%
Ground-truth 4.10

Ablations indicate the necessity of both the reasoning chain and the enriched transcript module for top performance. Qualitative analysis shows the system handles nuanced instructions, extracting emotions (e.g., doubt, contempt), prosodic cues (slower rate, pitch contours), and paralinguistic signals ([Breathing]) in generated speech (Ren et al., 4 Jan 2026).

7. Context and Impact

OV-InstructTTS establishes a new direction for instruction-based TTS by bridging the semantic gap between open-domain user intent and low-level synthesis controls. By structuring the input–reasoning–generation pipeline, it transcends prior label-rephrasing paradigms, making complex narration and dialogue synthesis more accessible for content creation without requiring granular acoustic tuning. Its unified dataset and architecture provide replicable benchmarks, supporting ongoing research into generalizable, user-centric TTS systems. The publicly available dataset and demos facilitate further study and practical application in expressive speech technologies (Ren et al., 4 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OV-InstructTTS.