HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis (2311.12454v2)

Published 21 Nov 2023 in cs.SD, cs.AI, cs.MM, and eess.AS

Abstract: LLMs (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/sh-lee-prml/HierSpeechpp.

PDF Abstract

Hierarchical Variational Inference for Zero-shot Speech Synthesis

The paper "HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis" presents a novel approach to zero-shot speech synthesis. This work introduces HierSpeech++, which advances fast and high-quality speech synthesis by employing a hierarchical variational inference framework.

Core Contributions

HierSpeech++ leverages a hierarchical structure to enhance the performance of both Text-to-Speech (TTS) and Voice Conversion (VC) systems. The authors propose a framework that operates without reliance on LLMs or autoregressive models, known for their significant data requirements and slower inference speeds. Key components of the HierSpeech++ architecture are:

Hierarchical Speech Synthesizer: This component improves the robustness of synthetic speech. It combines a variational autoencoder (VAE) for effective representation learning with a hierarchical adaptive generator (HAG) to generate high-quality waveform audio.
Text-to-Vec (TTV) Framework: Here, the system uses a self-supervised semantic representation, bridging text representations to speech, thereby effectively incorporating prosody prompts and F0 representations to achieve natural and expressive synthetic speech.
Speech Super-resolution (SpeechSR): This solution upscales audio from 16 kHz to 48 kHz, facilitating data accessibility and improving final speech quality.

Results and Comparisons

The paper demonstrates that HierSpeech++ outperforms other state-of-the-art methodologies such as diffusion-based models and prominent LLM-based models. Notable experimental outcomes include:

Achieving human-level quality metrics in zero-shot speech synthesis for both TTS and VC tasks, surpassing the perceptual quality of previous frameworks.
Achieving lower CER and WER rates, indicating robust text-to-speech conversion capabilities.
The capability to process noisy data and maintain performance stability compared to diffusion models.

HierSpeech++ also reports successful scalability with large datasets, without requiring textual transcripts or labels for training, making it a versatile and practical solution.

Implications and Future Directions

HierSpeech++ holds substantial implications for the field of AI-driven audio processing. Its ability to function efficiently without extensive labeled resources reduces the barrier for application development in diverse languages and dialects. Moreover, its design supports zero-shot learning, which is crucial for tasks like multilingual speech synthesis and variable voice style generation.

Future directions could explore extending the model's capabilities to cross-lingual and emotion-controllable synthesis by integrating further extensions such as non-autoregressive generation methods. The flexibility and efficiency of the hierarchical setup suggest significant potential for inclusion in broader AI systems, such as real-time translation and synthetic media creation.

In summary, the paper contributes a significant step forward in speech synthesis, particularly emphasizing rapid, scalable, and human-quality synthetic voice generation without excessive computational overhead.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Sang-Hoon Lee (24 papers)
Ha-Yeong Choi (7 papers)
Seung-Bin Kim (14 papers)
Seong-Whan Lee (132 papers)

Citations (17)

View on Semantic Scholar

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis (2311.12454v2)

Hierarchical Variational Inference for Zero-shot Speech Synthesis

Core Contributions

Results and Comparisons

Implications and Future Directions

Related Papers

GitHub

YouTube