Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (1806.04558v4)

Published 12 Jun 2018 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

PDF Abstract

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

The paper "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" provides a detailed examination of a neural network-based Text-to-Speech (TTS) system capable of synthesizing high-quality, natural speech in the voices of different speakers, including those unseen during training. This research integrates transfer learning techniques to decouple speaker representation from the TTS synthesis task, significantly enhancing data efficiency and generalization.

System Architecture

The proposed TTS system comprises three independent neural network components:

Speaker Encoder Network: Trained on a speaker verification task using noisy, untranscribed speech from thousands of speakers, this network generates fixed-dimensional speaker embeddings. The embeddings are designed to encode speaker characteristics in a manner robust to noise and phonetic variation.
Sequence-to-Sequence Synthesis Network (Tacotron 2): This network generates mel spectrograms from text conditioned on the speaker embeddings. By utilizing transfer learning, the synthesis network leverages the speaker variability knowledge encoded by the speaker encoder.
WaveNet-based Vocoder: Converts the mel spectrograms into time-domain waveform samples, ensuring high-quality audio output.

Methodology and Experiments

The speaker encoder's efficient generalization to unseen speakers is pivotal for the system's performance. The encoder achieves this by training on a vast dataset consisting of 36 million utterances from 18,000 speakers. The synthesis network, however, is trained on a comparatively smaller dataset, such as VCTK and LibriSpeech, allowing for efficient high-quality multispeaker TTS synthesis.

The interplay between the independently trained speaker encoder and the synthesis network allows the proposed model to synthesize the voice of any speaker using only a few seconds of that speaker's reference audio. This capacity for zero-shot learning is notable, as it obviates the need for extensive new speaker data, marking a significant advancement in TTS systems.

Evaluations and Results

Speech Naturalness

Mean Opinion Score (MOS) evaluations on naturalness reveal that the synthesized speech's naturalness scores hover around 4.0, close to the quality of real human speech (score ~4.5). Interestingly, the model's performance in terms of naturalness is consistent for both seen and unseen speakers, underscoring its strong generalizability.

Speaker Similarity

Evaluating how well the synthesized voice matches the target speaker, similarity MOS scores indicate that the system achieves a reasonable impression of target speakers, with scores averaging around 3.0 on a 1 to 5 scale for unseen speakers. This performance substantiates that while broad speaker characteristics (e.g., gender, pitch) are accurately transferred, some finer nuances may still be missed.

Speaker Verification (SV-EER)

Objective measures using speaker verification equal error rates (SV-EER) reinforce subjective findings. The model's performance varies with the size of the speaker encoder's training set, with larger, more diverse training sets yielding lower EERs and thus more accurate speaker representation. With a speaker encoder trained on 18,000 speakers, the SV-EERs for synthesized speech are around 5%, illustrating effective speaker characteristic retention.

Implications and Future Work

This approach's core implication lies in its data efficiency and robust generalization, making it particularly valuable in low-resource settings or applications requiring rapid adaptation to new speakers. However, the current model's inability to perfectly capture prosody nuances and accent transfer points to areas ripe for further research. Incorporating additional prosody modeling mechanisms or accent-specific conditioning could further enhance the model's performance.

Moreover, while the paper demonstrates the generation of fictitious speakers by sampling from the speaker embedding space, it opens avenues for further exploration into the limits and potentials of such synthetic speaker generation.

Conclusion

The integration of a discriminatively-trained speaker encoder with a high-quality TTS synthesis network through transfer learning marks a significant step in multispeaker TTS synthesis. This decoupling not only mitigates the need for high-quality, large-scale multispeaker transcripted data but also facilitates impressive zero-shot speaker adaptation. Future research should focus on refining the subtle elements of speech synthesis, including prosody and accent variations, to achieve even higher levels of naturalness and speaker fidelity.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Ye Jia (33 papers)
Yu Zhang (1399 papers)
Ron J. Weiss (30 papers)
Quan Wang (130 papers)
Jonathan Shen (13 papers)
Fei Ren (29 papers)
Zhifeng Chen (65 papers)
Patrick Nguyen (15 papers)
Ruoming Pang (59 papers)
Ignacio Lopez Moreno (24 papers)
Yonghui Wu (115 papers)

Citations (782)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos