RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis (2410.21641v1)
Abstract: Singing voice synthesis (SVS) aims to produce high-fidelity singing audio from music scores, requiring a detailed understanding of notes, pitch, and duration, unlike text-to-speech tasks. Although diffusion models have shown exceptional performance in various generative tasks like image and video creation, their application in SVS is hindered by time complexity and the challenge of capturing acoustic features, particularly during pitch transitions. Some networks learn from the prior distribution and use the compressed latent state as a better start in the diffusion model, but the denoising step doesn't consistently improve quality over the entire duration. We introduce RDSinger, a reference-based denoising diffusion network that generates high-quality audio for SVS tasks. Our approach is inspired by Animate Anyone, a diffusion image network that maintains intricate appearance features from reference images. RDSinger utilizes FastSpeech2 mel-spectrogram as a reference to mitigate denoising step artifacts. Additionally, existing models could be influenced by misleading information on the compressed latent state during pitch transitions. We address this issue by applying Gaussian blur on partial reference mel-spectrogram and adjusting loss weights in these regions. Extensive ablation studies demonstrate the efficiency of our method. Evaluations on OpenCpop, a Chinese singing dataset, show that RDSinger outperforms current state-of-the-art SVS methods in performance.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22563–22575.
- Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780–8794.
- Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
- Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
- Hu, L. 2024. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8153–8163.
- Hiddensinger: High-quality singing voice synthesis via neural audio codec and latent diffusion models. arXiv preprint arXiv:2306.06814.
- Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71: 1–15.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, 5530–5540. PMLR.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33: 17022–17033.
- Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761.
- Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI conference on artificial intelligence, volume 36, 11020–11028.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
- Wave-GAN: a deep learning approach for the prediction of nonlinear regular wave loads and run-up on a fixed cylinder. Coastal Engineering, 167: 103902.
- Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, 8599–8608. PMLR.
- Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3617–3621. IEEE.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2): 3.
- DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 886–890. IEEE.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
- Deepsinger: Singing voice synthesis with data mined from the web. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1979–1989.
- Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, 749–752. IEEE.
- A hierarchical latent vector model for learning long-term structure in music. In International conference on machine learning, 4364–4373. PMLR.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4779–4783. IEEE.
- An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on audio, speech, and language processing, 19(7): 2125–2136.
- Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429.
- ESPnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.
- Adversarially trained multi-singer sequence-to-sequence singing synthesizer. arXiv preprint arXiv:2006.10317.
- Dynamic Sliding Window for Realtime Denoising Networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 361–365. IEEE.
- Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6199–6203. IEEE.
- Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7237–7241. IEEE.