AdaSpeech: Adaptive Text to Speech for Custom Voice (2103.00993v1)

Published 1 Mar 2021 in eess.AS, cs.AI, cs.CL, and cs.SD

Abstract: Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at https://speechresearch.github.io/adaspeech/.

Authors (7)

Mingjian Chen (11 papers)
Xu Tan (164 papers)
Bohan Li (88 papers)
Yanqing Liu (48 papers)
Tao Qin (201 papers)
Sheng Zhao (75 papers)
Tie-Yan Liu (242 papers)

Citations (177)

View on Semantic Scholar

Summary

Overview of AdaSpeech: Adaptive Text to Speech for Custom Voice

AdaSpeech: Adaptive Text to Speech for Custom Voice focuses on the development of a text-to-speech (TTS) model that addresses the challenges associated with generating custom voices effectively while using minimal adaptation data. The primary goal is to enable speech synthesis systems to seamlessly adapt to a wide range of speakers and conditions, catering to commercial applications where custom voice generation is increasingly in demand.

Key Challenges and Proposed Solutions

Custom voice generation within TTS systems faces two main challenges:

Diverse Acoustic Conditions: Adaptation data often exhibits varied acoustic properties relative to the initial training data. This includes differences in speaker style, emotion, accent, prosody, and recording environments.
Parameter Efficiency: It is necessary that the adaptation process requires minimal parameter updates, ensuring the custom voice retains high fidelity and reduces memory usage efficiently—especially critical when scaling to support numerous users.

AdaSpeech introduces several innovative techniques to tackle these challenges:

Acoustic Condition Modeling: This technique involves encoding specific acoustic conditions at both utterance and phoneme levels. During pre-training and fine-tuning, AdaSpeech employs two different acoustic encoders to obtain these vectors, which capture the nuances of both global and local acoustic conditions. This is crucial for improving the model's adaptability to diverse acoustic properties during inference.
Conditional Layer Normalization: By conditioning layer normalization on speaker embeddings, AdaSpeech effectively reduces the necessity to fine-tune extensive model parameters. This approach allows for both model efficiency and retention of voice quality, optimizing the adaptation sizes to approximately 5K speaker-specific parameters.

Experimental Validation and Results

AdaSpeech was evaluated through experiments leveraging public datasets such as LibriTTS, VCTK, and LJSpeech. The model demonstrated significant improvement in adaptation quality compared to baseline methods. Specifically, it achieved superior Mean Opinion Score (MOS) and Similarity MOS (SMOS), indicating enhanced naturalness and similarity of the synthesized voice to the target speaker.

Key results showcased that AdaSpeech could provide custom voice solutions that are lightweight in terms of model size while offering high-quality speech synthesis. It achieves similar MOS and SMOS scores to models requiring extensive parameter adjustments, thereby validating its approach to efficient voice adaptation.

Implications and Future Work

AdaSpeech significantly advances the practice of customized voice adaptations in commercial TTS platforms, presenting efficiencies both in adaptation performance and computational resources. Practical implementations, such as virtual assistants and navigation systems, stand to benefit from this model through more personalized and contextually appropriate voice outputs.

Future advancements in this area could explore extending acoustic condition modeling to untranscribed data scenarios. Additionally, improving the adaptability to dynamically varying acoustic environments and further reducing model footprints are identified as promising areas of research.

Conclusion

AdaSpeech makes a substantial contribution towards improving custom voice solutions, addressing the significant challenges of diverse acoustic conditions and efficient parameter usage. By integrating advanced acoustic modeling and conditional layer normalization, AdaSpeech offers a potent solution for scalable, high-quality TTS adaptation in diverse application environments.

PDF Markdown

Related Papers

GitHub

AdaSpeech: Adaptive Text to Speech for Custom Voice - Speech Research