Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Building speech corpus with diverse voice characteristics for its prompt-based representation (2403.13353v1)

Published 20 Mar 2024 in cs.SD and eess.AS

Abstract: In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limits the diversity of voice characteristics available. Thus, we aim to address this gap by creating a novel corpus and developing a model for prompt-based manipulation of voice characteristics in text-to-speech synthesis, facilitating a broader range of voice characteristics. Specifically, we propose a method to build a sizable corpus pairing voice characteristics descriptions with corresponding speech samples. This involves automatically gathering voice-related speech data from the Internet, ensuring its quality, and manually annotating it using crowdsourcing. We implement this method with Japanese language data and analyze the results to validate its effectiveness. Subsequently, we propose a construction method of the model to retrieve speech from voice characteristics descriptions based on a contrastive learning method. We train the model using not only conservative contrastive learning but also feature prediction learning to predict quantitative speech features corresponding to voice characteristics. We evaluate the model performance via experiments with the corpus we constructed above.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” in Proc. SSW, Sunnyvale, U.S.A., 2016, p. 125.
  2. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. INTERSPEECH, Stockholm, Sweden, 2017, pp. 4006–4010.
  3. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, robust and controllable text to speech,” Proc. NeurIPS, vol. 32, 2019.
  4. N. Hojo, Y. Ijima, and H. Mizuno, “DNN-based speech synthesis using speaker codes,” IEICE TRANSACTIONS on Information and Systems, vol. E101-D, no. 2, pp. 462–472, 2017.
  5. D. Stanton, M. Shannon, S. Mariooryad, R. Skerry-Ryan, E. Battenberg, T. Bagby, and D. Kao, “Speaker generation,” in Proc. ICASSP, 2022, pp. 7897–7901.
  6. A. Watanabe, S. Takamichi, Y. Saito, D. Xin, and H. Saruwatari, “Mid-attribute speaker generation using optimal-transport-based interpolation of gaussian mixture models,” in Proc. ICASSP.   IEEE, 2023, pp. 1–5.
  7. J. Gustafson, J. Beskow, and E. Szekely, “Personality in the mix - investigating the contribution of fillers and speaking style to the perception of spontaneous speech synthesis,” in Proc. Speech Synthesis Workshop, 2021, pp. 48–53.
  8. Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning,” in Proc. Interspeech, Graz, Austria, Sep. 2019, pp. 2080–2084.
  9. R. Liu, B. Sisman, and H. Li, “Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability,” in Proc. Interspeech, Aug. 2021, pp. 4648–4652.
  10. K. Ohta, T. Toda, Y. Ohtani, H. Saruwatari, and K. Shikano, “Adaptive voice-quality control based on one-to-many eigenvoice conversion,” in Proc. Interspeech, 2010, pp. 2158–2161.
  11. T. Raitio, R. Rasipuram, and D. Castellani, “Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features,” in Proc. Interspeech, 2020, pp. 4432–4436.
  12. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” arXiv:2102.12092, 2021.
  13. B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang, “CLAP: Learning audio concepts from natural language supervision,” arXiv:2206.04769, 2022.
  14. Q. Huang, A. Jansen, J. Lee, R. Ganti, J. Y. Li, and D. P. W. Ellis, “MuLan: A joint embedding of music audio and natural language,” arXiv:2208.12415, 2022.
  15. J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Imagen Video: High definition video generation with diffusion models,” arXiv:2210.02303, 2022.
  16. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 8748–8763. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
  17. T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common objects in context,” arXiv:1405.0312, 2014.
  18. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 2556–2565. [Online]. Available: https://aclanthology.org/P18-1238
  19. C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019.
  20. K. Drossos, S. Lipping, and T. Virtanen, “Clotho: an audio captioning dataset,” in Proc. ICASSP, 2020, pp. 736–740.
  21. Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” arXiv:2211.06687, 2022.
  22. Z. Guo, Y. Leng, Y. Wu, S. Zhao, and X. Tan, “PromptTTS: Controllable text-to-speech with text descriptions,” arXiv:2211.12171, 2022.
  23. D. Yang, S. Liu, R. Huang, G. Lei, C. Weng, H. Meng, and D. Yu, “InstructTTS: Modelling expressive TTS in discrete latent space with natural language style prompt,” arXiv:2301.13662, 2023.
  24. Y. Zhang, G. Liu, Y. Lei, Y. Chen, H. Yin, L. Xie, and Z. Li, “Promptspeaker: Speaker generation based on text descriptions,” arXiv preprint arXiv:2310.05001, 2023.
  25. D. Lyth and S. King, “Natural language guidance of high-fidelity text-to-speech with synthetic annotations,” arXiv preprint arXiv:2402.01912, 2024.
  26. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
  27. S. Takamichi, R. Sonobe, K. Mitsui, Y. Saito, T. Koriyama, N. Tanji, and H. Saruwatari, “JSUT and JVS: Free Japanese voice corpora for accelerating speech synthesis research,” Acoustical Science and Technology, vol. 41, no. 5, pp. 761–768, 2020.
  28. A. Watanabe, S. Takamichi, Y. Saito, W. Nakata, D. Xin, and H. Saruwatari, “Coco-nut: Corpus of japanese utterance and voice characteristics description for prompt-based control,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8.
  29. A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank, “MusicLM: Generating music from text,” arXiv:2301.11325, 2023.
  30. H. Ohnaka, S. Takamichi, K. Imoto, Y. Okamoto, K. Fujii, and H. Saruwatari, “Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images,” in Proc. ICASSP.   IEEE, 2023, pp. 1–5.
  31. Y. Wu, K. Chen, T. Zhang, M. Nezhurina, and Y. Hui, “LAION-Audio-630K,” 2022, https://github.com/LAION-AI/audio-dataset.
  32. E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y. Matias, Y. Pritch, Y. Leviathan, and Y. Hoshen, “Dreamix: Video diffusion models are general video editors,” arXiv:2302.01329, 2023.
  33. A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, p. 101027, 2020.
  34. K. Maekawa, “Corpus of spontaneous japanese : its design and evaluation,” Proceedings of The ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR 2003), pp. 7–12, 2003.
  35. C.-F. Yeh, P.-Y. Huang, V. Sharma, S.-W. Li, and G. Gosh, “Flap: Fast language-audio pre-training,” arXiv preprint arXiv:2311.01615, 2023.
  36. G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “GigaSpeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
  37. S. Takamichi, L. Kürzinger, T. Saeki, S. Shiota, and S. Watanabe, “JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification,” arXiv:2112.09323, 2021.
  38. S. F. Yue Yin, Daijiro Mori, “ReazonSpeech: A free and massive corpus for Japanese ASR,” in Annual meetings of the Association for Natural Language Processing, 2023.
  39. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  40. D. Doukhan, J. Carrive, F. Vallet, A. Larcher, and S. Meignier, “An open-source speaker gender detection framework for monitoring gender equality,” in Proc. ICASSP.   IEEE, 2018.
  41. G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Proc. Interspeech, 2021, pp. 2127–2131.
  42. J. H. W. Jr., “Hierarchical grouping to optimize an objective function,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963.
  43. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP.   IEEE, 2018, pp. 5329–5333.
  44. A. Brown, J. Huh, A. Nagrani, J. S. Chung, and A. Zisserman, “Playing a part: Speaker verification at the movies,” in Proc. ICASSP.   IEEE, 2021, pp. 6174–6178.
  45. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
  46. J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked language model scoring,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.   Online: Association for Computational Linguistics, Jul. 2020, pp. 2699–2712. [Online]. Available: https://aclanthology.org/2020.acl-main.240
  47. K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. ICASSP.   IEEE, 2022, pp. 646–650.
  48. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  49. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  50. X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics.   JMLR Workshop and Conference Proceedings, 2011, pp. 315–323.
  51. W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modification of speech,” in 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2.   IEEE, 1993, pp. 554–557.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com