Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations (2402.01520v1)

Published 2 Feb 2024 in cs.SD, cs.LG, and eess.AS

Abstract: In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preprocess these representations by selecting only a subset of their task-correlated dimensions. The conditioning module is indirectly guided to capture style information during training by multi-tasking. This is achieved with a Conformer-based module, which predicts the pitch from the acoustic model's output. Thus, Karaoker-SSL allows singing voice synthesis without reliance on hand-crafted and domain-specific features. There are also no requirements for text alignments or lyrics timestamps. To refine the voice quality, we employ a U-Net discriminator that is conditioned on the target speaker and follows a Diffusion GAN training scheme.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Panos Kakoulidis (10 papers)
  2. Nikolaos Ellinas (23 papers)
  3. Georgios Vamvoukakis (12 papers)
  4. Myrsini Christidou (6 papers)
  5. Alexandra Vioni (9 papers)
  6. Georgia Maniati (10 papers)
  7. Junkwang Oh (4 papers)
  8. Gunu Jho (9 papers)
  9. Inchul Hwang (12 papers)
  10. Pirros Tsiakoulis (17 papers)
  11. Aimilios Chalamandaris (17 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com