Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Voice 2: Multi-Speaker Neural Text-to-Speech (1705.08947v2)

Published 24 May 2017 in cs.CL

Abstract: We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

The research presented in "Deep Voice 2: Multi-Speaker Neural Text-to-Speech" explores the enhancement of neural text-to-speech (TTS) systems by utilizing low-dimensional trainable speaker embeddings to generate multiple voices from a single model. This work builds on previous advancements in neural TTS, improving on existing single-speaker systems like Deep Voice 1 and Tacotron, and extends them to handle hundreds of distinct speaker voices with limited data per speaker.

Methodology and Contributions

  1. Deep Voice 2 Architecture: The Deep Voice 2 system retains the foundational pipeline of its predecessor, Deep Voice 1, but employs high-performance components to deliver a substantial increase in audio quality.
  2. Improved Tacotron with Neural Vocoder: The integration of a WaveNet-based spectrogram-to-audio neural vocoder in Tacotron replaces the traditional Griffin-Lim algorithm, enhancing overall audio output quality. This demonstrates the feasibility of using neural vocoders in TTS for more natural-sounding speech.
  3. Multi-Speaker Training: Introducing trainable speaker embeddings into the Deep Voice 2 and Tacotron models allows a single neural TTS framework to learn and produce a wide variety of voices. The embedding method enables extensive parameter sharing among different voices within the model, significantly reducing data requirements for each speaker compared to single-speaker models.

Results

The paper provides thorough experimental results demonstrating the superiority of Deep Voice 2 over Deep Voice 1 and enhanced performance when using neural vocoders with Tacotron. Notably:

  • The Deep Voice 2 system exhibited a marked improvement in Mean Opinion Score (MOS) from 2.05 to 2.96 compared to Deep Voice 1, confirming enhanced audio quality.
  • Tacotron, when paired with the WaveNet neural vocoder, achieved an MOS of 4.17, which is significantly higher than its performance with the Griffin-Lim approach, which was 2.57.
  • Multi-speaker evaluations show that Deep Voice 2 can generate high-quality multi-speaker outputs with near-perfect speaker identity preservation, achieving classification accuracies comparable to ground truth samples.

Implications and Future Work

The implications of this research are both practical and theoretical. Practically, the ability to generate high-fidelity multi-speaker TTS with minimal data per speaker has significant potential across various applications such as accessibility tools, virtual assistants, and media production. Theoretically, this work advances understanding in the domain of neural TTS systems, particularly in efficient speaker representation and model scalability.

Future investigations could explore the scalability limits of these methods, examining how many speakers can be effectively incorporated and the minimal data requirements for high-quality synthesis. Additionally, research could focus on the adaptability of trained models to new speakers, potentially allowing for dynamic updating of speaker embeddings without retraining the entire system. There is also potential to leverage the learned embeddings for other tasks, such as speaker conversion or voice cloning, expanding the utility of the embeddings beyond TTS.

This paper exemplifies the continuing evolution of neural TTS systems, bridging the gap towards more versatile and data-efficient multi-speaker models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sercan Arik (9 papers)
  2. Gregory Diamos (11 papers)
  3. Andrew Gibiansky (5 papers)
  4. John Miller (41 papers)
  5. Kainan Peng (11 papers)
  6. Wei Ping (51 papers)
  7. Jonathan Raiman (17 papers)
  8. Yanqi Zhou (30 papers)
Citations (483)