HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models (2306.06814v1)
Abstract: Recently, denoising diffusion models have demonstrated remarkable performance among generative models in various domains. However, in the speech domain, the application of diffusion models for synthesizing time-varying audio faces limitations in terms of complexity and controllability, as speech synthesis requires very high-dimensional samples with long-term acoustic features. To alleviate the challenges posed by model complexity in singing voice synthesis, we propose HiddenSinger, a high-quality singing voice synthesis system using a neural audio codec and latent diffusion models. To ensure high-fidelity audio, we introduce an audio autoencoder that can encode audio into an audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model using an unlabeled singing voice dataset. Experimental results demonstrate that our model outperforms previous models in terms of audio quality. Furthermore, the HiddenSinger-U can synthesize high-quality singing voices of speakers trained solely on unlabeled data.
- D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” arXiv preprint arXiv:1312.6114, 2013.
- D. Rezende and S. Mohamed, “Variational Inference with Normalizing Flows,” in International Conference on Machine Learning. PMLR, 2015, pp. 1530–1538.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- G. Degottex, L. Ardaillon, and A. Roebel, “Multi-Frame Amplitude Envelope Estimation for Modification of Singing Voice,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1242–1254, 2016.
- J. Chen, X. Tan, J. Luan, T. Qin, and T.-Y. Liu, “Hifisinger: Towards High-fidelity Neural Singing Voice Synthesis,” arXiv preprint arXiv:2009.01776, 2020.
- Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Sinsy: A Deep Neural Network-based Singing Voice Synthesis System,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2803–2815, 2021.
- G.-H. Lee, T.-W. Kim, H. Bae, M.-J. Lee, Y.-I. Kim, and H.-Y. Cho, “N-Singer: A Non-autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement,” arXiv preprint arXiv:2106.15205, 2021.
- S. Choi and J. Nam, “A Melody-Unsupervision Model for Singing Voice Synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7242–7246.
- J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 020–11 028.
- F. Chen, R. Huang, C. Cui, Y. Ren, J. Liu, and Z. Zhao, “SingGAN: Generative Adversarial Network for High-fidelity Singing Voice Generation,” arXiv preprint arXiv:2110.07468, 2021.
- D.-Y. Wu, W.-Y. Hsiao, F.-R. Yang, O. Friedman, W. Jackson, S. Bruzenak, Y.-W. Liu, and Y.-H. Yang, “DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and a Comprehensive Evaluation,” arXiv preprint arXiv:2208.04756, 2022.
- S.-H. Lee, H.-R. Noh, W.-J. Nam, and S.-W. Lee, “Duration Controllable Voice Conversion via Phoneme-based Information Bottleneck,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1173–1183, 2022.
- Y. Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, and M. Bi, “ViSinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7237–7241.
- X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He et al., “NaturalSpeech: End-to-End Text to Speech Synthesis with Human-level Quality,” arXiv preprint arXiv:2205.04421, 2022.
- A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv preprint arXiv:1609.03499, 2016.
- R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A Fast Waveform Generation Model based on Generative Adversarial Networks with Multi-resolution Spectrogram,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020, pp. 6199–6203.
- Y. Ai and Z.-H. Ling, “A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 839–851, 2020.
- K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative Adversarial Networks for Conditional Waveform Synthesis,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- J. Kong, J. Kim, and J. Bae, “Hifi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
- R. Huang, M. W. Lam, J. Wang, D. Su, D. Yu, Y. Ren, and Z. Zhao, “FastDiff: A Fast Conditional Diffusion Model for High-quality Speech Synthesis,” arXiv preprint arXiv:2204.09934, 2022.
- N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An End-to-End Neural Audio Codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
- A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High Fidelity Neural Audio Compression,” arXiv preprint arXiv:2210.13438, 2022.
- J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
- P. Dhariwal and A. Nichol, “Diffusion Models Beat GANs on Image Synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021.
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-conditional Image Generation with Clip Latents,” arXiv preprint arXiv:2204.06125, 2022.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution Image Synthesis with Latent Diffusion Models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695.
- D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, “Diffsound: Discrete Diffusion model for Text-to-Sound Generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-an-Audio: Text-to-Audio Generation with Prompt-Enhanced Diffusion Models,” arXiv preprint arXiv:2301.12661, 2023.
- J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video Diffusion Models,” arXiv preprint arXiv:2204.03458, 2022.
- V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 8599–8608.
- H.-Y. Choi, S.-H. Lee, and S.-W. Lee, “DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion,” arXiv preprint arXiv:2305.15816, 2023.
- Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A Versatile Diffusion Model for Audio Synthesis,” arXiv preprint arXiv:2009.09761, 2020.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating Gradients for Waveform Generation,” arXiv preprint arXiv:2009.00713, 2020.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, N. Dehak, and W. Chan, “WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis,” arXiv preprint arXiv:2106.09660, 2021.
- Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based Generative Modeling through Stochastic Differential Equations,” arXiv preprint arXiv:2011.13456, 2020.
- A. Vahdat, K. Kreis, and J. Kautz, “Score-based Generative Modeling in Latent Space,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 287–11 302, 2021.
- H. Kim, S. Kim, and S. Yoon, “Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance,” in International Conference on Machine Learning. PMLR, 2022, pp. 11 119–11 133.
- A. Van Den Oord, O. Vinyals et al., “Neural Discrete Representation Learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A Neural Vocoder with Multi-resolution Spectrogram Discriminators for High-fidelity Waveform Generation,” arXiv preprint arXiv:2106.07889, 2021.
- A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond Pixels Using a Learned Similarity Metric,” in International Conference on Machine Learning. PMLR, 2016, pp. 1558–1566.
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in International Conference on Machine Learning, 2006, pp. 369–376.
- J. Lee, H.-S. Choi, C.-B. Jeon, J. Koo, and K. Lee, “Adversarially Trained End-to-End Korean Singing Voice Synthesis System,” arXiv preprint arXiv:1908.01919, 2019.
- Y. Shirahata, R. Yamamoto, E. Song, R. Terashima, J.-M. Kim, and K. Tachibana, “Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis,” arXiv preprint arXiv:2210.15964, 2022.
- S.-G. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “Priorgrad: Improving Conditional Denoising Diffusion Models with Data-driven Adaptive Prior,” arXiv preprint arXiv:2106.06406, 2021.
- H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural Analysis and Synthesis: Reconstructing Speech from Self-supervised Representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 16 251–16 265, 2021.
- S.-H. Lee, S.-B. Kim, J.-H. Lee, E. Song, M.-J. Hwang, and S.-W. Lee, “HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis,” in Advances in Neural Information Processing Systems, 2022.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” in International Conference on Machine Learning. PMLR, 2020, pp. 1597–1607.
- K. Qian, Y. Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, M. Hasegawa-Johnson, and S. Chang, “Contentvec: An Improved Self-supervised Speech Representation by Disentangling Speakers,” in International Conference on Machine Learning. PMLR, 2022, pp. 18 003–18 017.
- S. Choi, W. Kim, S. Park, S. Yong, and J. Nam, “Children’s Song Dataset for Singing Voice Research,” in International Society for Music Information Retrieval Conference (ISMIR), 2020.
- A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al., “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” arXiv preprint arXiv:2111.09296, 2021.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-supervised Learning of Speech Representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
- I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” arXiv preprint arXiv:1711.05101, 2017.
- J. Donahue, S. Dieleman, M. Bińkowski, E. Elsen, and K. Simonyan, “End-to-End Adversarial Text-to-Speech,” arXiv preprint arXiv:2006.03575, 2020.
- Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and High-quality End-to-End Text to Speech,” arXiv preprint arXiv:2006.04558, 2020.
- J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
- R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A Flow-based Generative Network for Speech Synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621.
- P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with Relative Position Representations,” arXiv preprint arXiv:1803.02155, 2018.
- J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A Generative Flow for Text-to-Speech via Monotonic Alignment Search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
- M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio, “Chunked Autoregressive GAN for Conditional Waveform Synthesis,” arXiv preprint arXiv:2110.10139, 2021.
- A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual Evaluation of Speech Quality (PESQ)-A New Method for Speech Quality Assessment of Telephone Networks and Codecs,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2. IEEE, 2001, pp. 749–752.
- M. Müller, “Dynamic Time Warping,” Information retrieval for music and motion, pp. 69–84, 2007.
- K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers,” arXiv preprint arXiv:2304.09116, 2023.
- M. W. Lam, Q. Tian, T. Li, Z. Yin, S. Feng, M. Tu, Y. Ji, R. Xia, M. Ma, X. Song et al., “Efficient Neural Music Generation,” arXiv preprint arXiv:2305.15719, 2023.
- Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency Models,” arXiv preprint arXiv:2303.01469, 2023.