Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations (2402.01520v1)
Abstract: In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preprocess these representations by selecting only a subset of their task-correlated dimensions. The conditioning module is indirectly guided to capture style information during training by multi-tasking. This is achieved with a Conformer-based module, which predicts the pitch from the acoustic model's output. Thus, Karaoker-SSL allows singing voice synthesis without reliance on hand-crafted and domain-specific features. There are also no requirements for text alignments or lyrics timestamps. To refine the voice quality, we employ a U-Net discriminator that is conditioned on the target speaker and follows a Diffusion GAN training scheme.
- Panos Kakoulidis (10 papers)
- Nikolaos Ellinas (23 papers)
- Georgios Vamvoukakis (12 papers)
- Myrsini Christidou (6 papers)
- Alexandra Vioni (9 papers)
- Georgia Maniati (10 papers)
- Junkwang Oh (4 papers)
- Gunu Jho (9 papers)
- Inchul Hwang (12 papers)
- Pirros Tsiakoulis (17 papers)
- Aimilios Chalamandaris (17 papers)