Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis (2108.02271v2)

Published 4 Aug 2021 in cs.SD and eess.AS

Abstract: This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art for cross-speaker prosody transfer on any text. This is one of the most challenging, and rarely directly addressed, task in speech synthesis, especially for highly expressive data. Daft-Exprt uses FiLM conditioning layers to strategically inject different prosodic information in all parts of the architecture. The model explicitly encodes traditional low-level prosody features such as pitch, loudness and duration, but also higher level prosodic information that helps generating convincing voices in highly expressive styles. Speaker identity and prosodic information are disentangled through an adversarial training strategy that enables accurate prosody transfer across speakers. Experimental results show that Daft-Exprt significantly outperforms strong baselines on inter-text cross-speaker prosody transfer tasks, while yielding naturalness comparable to state-of-the-art expressive models. Moreover, results indicate that the model discards speaker identity information from the prosody representation, and consistently generate speech with the desired voice. We publicly release our code and provide speech samples from our experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards End-to-end Speech Synthesis,” in INTERSPEECH, 2017.
  2. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in ICASSP, 2018.
  3. N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Neural Speech Synthesis with Transformer Network,” in AAAI, 2019.
  4. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards End-to-end Prosody Transfer for Expressive Speech Synthesis with Tacotron,” in ICML, 2018.
  5. M. I. C. Aarestrup, L. C. Jensen, and K. Fischer, “The Sound Makes the Greeting: Interpersonal Functions of Intonation in Human-Robot Interaction,” in AAAI Spring Symposium, 2015.
  6. Y. Lee and T. Kim, “Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis,” in ICASSP, 2019.
  7. S. Karlapati, A. Moinet, A. Joly, V. Klimkov, D. Sáez-Trigueros, and T. Drugman, “CopyCat: Many-to-many Fine-grained Prosody Transfer for Neural Text-to-speech,” in INTERSPEECH, 2020.
  8. R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens,” in ICASSP, 2020.
  9. T. Li, X. Wang, Q. Xie, Z. Wang, and L. Xie, “Controllable Cross-speaker Emotion Transfer for End-to-end Speech Synthesis,” arXiv, 2021.
  10. Y. Bian, C. Chen, Y. Kang, and Z. Pan, “Multi-reference Tacotron by Intercross Training for Style Disentangling , Transfer and Control in Speech Synthesis,” in INTERSPEECH, 2019.
  11. M. Whitehill, S. Ma, D. McDuff, and Y. Song, “Multi-reference Neural TTS Stylization with Adversarial Cycle Consistency,” in INTERSPEECH, 2020.
  12. X. An, F. K. Soong, and L. Xie, “Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS,” in INTERSPEECH, 2021.
  13. Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-end Speech Synthesis,” in ICML, 2018.
  14. Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis,” in ICASSP, 2019.
  15. W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical Generative Modeling for Controllable Speech Synthesis,” in ICLR, 2019.
  16. E. Battenberg, S. Mariooryad, D. Stanton, R. Skerry-Ryan, M. Shannon, D. Kao, and T. Bagby, “Effective Use of Variational Embedding Capacity in Expressive End-to-end Speech Synthesis,” in ICLR, 2020.
  17. R. Valle, K. Shih, R. Prenger, and B. Catanzaro, “Flowtron: an Autoregressive Flow-based Generative Network for Text-to-speech Synthesis,” in ICLR, 2021.
  18. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and High-quality End-to-end Text to Speech,” in ICLR, 2021.
  19. A. Łańcucki, “FastPitch: Parallel Text-to-speech with Pitch Prediction,” in ICASSP, 2021.
  20. J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv, 2020.
  21. K. Lee, K. Park, and D. Kim, “STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech,” in INTERSPEECH, 2021.
  22. Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-language Voice Cloning,” in INTERSPEECH, 2019.
  23. Z. Shang, Z. Huang, H. Zhang, P. Zhang, and Y. Yan, “Incorporating Cross-speaker Style Transfer for Multi-language Text-to-speech,” in INTERSPEECH, 2021.
  24. V. Dumoulin, J. Shlens, and M. Kudlur, “A Learned Representation for Artistic Style,” in ICLR, 2017.
  25. X. Huang and S. Belongie, “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization,” in ICCV, 2017.
  26. T. Kim, I. Song, and Y. Bengio, “Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition,” in INTERSPEECH, 2017.
  27. Y.-H. Chen, D.-Y. Wu, T.-H. Wu, and H.-Y. Lee, “AGAIN-VC: A One-shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization,” in ICASSP, 2021.
  28. M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, S. Zhao, and T.-Y. Liu, “AdaSpeech: Adaptive Text to Speech for Custom Voice,” in ICLR, 2021.
  29. D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-StyleSpeech: Multi-speaker Adaptive Text-to-speech Generation,” in ICML, 2021.
  30. E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: Visual Reasoning with a General Conditioning Layer,” in AAAI, 2018.
  31. B. N. Oreshkin, P. Rodriguez, and A. Lacoste, “TADAM: Task Dependent Adaptive Metric for Improved Few-shot Learning,” in NeurIPS, 2018.
  32. Y. Ganin and V. Lempitsky, “Unsupervised Domain Adaptation by Backpropagation,” in ICML, 2015.
  33. K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  34. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-speech,” in INTERSPEECH, 2019.
  35. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” in NeurIPS, 2020.
  36. S. King and V. Karaiskos, “The blizzard challenge 2013,” Blizzard Challenge Workshop, 2013.
  37. S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLOS ONE, 2018.
  38. ITU, “Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems (MUSHRA),” Tech. Rep., 2015.
  39. J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards Achieving Robust Universal Neural Vocoding,” in INTERSPEECH, 2019.
  40. O. Watts, G. E. Henter, J. Fong, and C. Valentini-Botinhao, “Where Do the Improvements Come From in Sequence-to-sequence Neural TTS?” in ISCA Speech Synthesis Workshop, 2019.
Citations (17)

Summary

We haven't generated a summary for this paper yet.