Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt (2403.11780v2)

Published 18 Mar 2024 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Hifisinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Zhiyong Zhang Dong Wang, Xuewei Zhang. 2015. Thchs-30 : A free chinese speech corpus.
  7. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  8. Natural language supervision for general-purpose audio representations.
  9. Didispeech: A large scale mandarin speech corpus. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6968–6972. IEEE.
  10. Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  11. Unisinger: Unified end-to-end singing voice synthesis with cross-modality information matching. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7569–7579.
  12. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474.
  13. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3945–3954.
  14. Singgan: Generative adversarial network for high-fidelity singing voice generation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2525–2535.
  15. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, pages 13916–13932. PMLR.
  16. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995.
  17. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. In Advances in Neural Information Processing Systems.
  18. Make-a-voice: Unified voice synthesis with discrete representation. arXiv preprint arXiv:2305.19269.
  19. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352.
  20. Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658.
  21. Prompttts 2: Describing and generating voices with text prompt. arXiv preprint arXiv:2309.02285.
  22. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 11020–11028.
  23. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
  24. Masanori Morise et al. 2017. Harvest: A high-performance fundamental frequency estimator from speech signals. In INTERSPEECH, pages 2321–2325.
  25. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR.
  26. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
  27. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567.
  28. Hermann Von Helmholtz. 1912. On the Sensations of Tone as a Physiological Basis for the Theory of Music. Longmans, Green.
  29. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
  30. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429.
  31. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  32. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662.
  33. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704.
  34. Promptvc: Flexible stylistic voice conversion in latent space driven by natural language prompts. arXiv preprint arXiv:2309.09262.
  35. Megabyte: Predicting million-byte sequences with multiscale transformers. Advances in Neural Information Processing Systems, 36.
  36. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
  37. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. Advances in Neural Information Processing Systems, 35:6914–6926.
  38. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7237–7241. IEEE.
  39. Promptspeaker: Speaker generation based on text descriptions. arXiv preprint arXiv:2310.05001.
  40. Wesinger: Data-augmented singing voice synthesis with auxiliary losses. arXiv preprint arXiv:2203.10750.
  41. Wesinger 2: fully parallel singing voice synthesis via multi-singer conditional adversarial training. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yongqi Wang (24 papers)
  2. Ruofan Hu (6 papers)
  3. Rongjie Huang (62 papers)
  4. Zhiqing Hong (13 papers)
  5. Ruiqi Li (44 papers)
  6. Wenrui Liu (11 papers)
  7. Fuming You (6 papers)
  8. Tao Jin (53 papers)
  9. Zhou Zhao (219 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub