Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models (2306.06814v1)

Published 12 Jun 2023 in eess.AS, cs.AI, cs.SD, and eess.SP

Abstract: Recently, denoising diffusion models have demonstrated remarkable performance among generative models in various domains. However, in the speech domain, the application of diffusion models for synthesizing time-varying audio faces limitations in terms of complexity and controllability, as speech synthesis requires very high-dimensional samples with long-term acoustic features. To alleviate the challenges posed by model complexity in singing voice synthesis, we propose HiddenSinger, a high-quality singing voice synthesis system using a neural audio codec and latent diffusion models. To ensure high-fidelity audio, we introduce an audio autoencoder that can encode audio into an audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model using an unlabeled singing voice dataset. Experimental results demonstrate that our model outperforms previous models in terms of audio quality. Furthermore, the HiddenSinger-U can synthesize high-quality singing voices of speakers trained solely on unlabeled data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” arXiv preprint arXiv:1312.6114, 2013.
  2. D. Rezende and S. Mohamed, “Variational Inference with Normalizing Flows,” in International Conference on Machine Learning.   PMLR, 2015, pp. 1530–1538.
  3. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  4. G. Degottex, L. Ardaillon, and A. Roebel, “Multi-Frame Amplitude Envelope Estimation for Modification of Singing Voice,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1242–1254, 2016.
  5. J. Chen, X. Tan, J. Luan, T. Qin, and T.-Y. Liu, “Hifisinger: Towards High-fidelity Neural Singing Voice Synthesis,” arXiv preprint arXiv:2009.01776, 2020.
  6. Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Sinsy: A Deep Neural Network-based Singing Voice Synthesis System,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2803–2815, 2021.
  7. G.-H. Lee, T.-W. Kim, H. Bae, M.-J. Lee, Y.-I. Kim, and H.-Y. Cho, “N-Singer: A Non-autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement,” arXiv preprint arXiv:2106.15205, 2021.
  8. S. Choi and J. Nam, “A Melody-Unsupervision Model for Singing Voice Synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 7242–7246.
  9. J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 020–11 028.
  10. F. Chen, R. Huang, C. Cui, Y. Ren, J. Liu, and Z. Zhao, “SingGAN: Generative Adversarial Network for High-fidelity Singing Voice Generation,” arXiv preprint arXiv:2110.07468, 2021.
  11. D.-Y. Wu, W.-Y. Hsiao, F.-R. Yang, O. Friedman, W. Jackson, S. Bruzenak, Y.-W. Liu, and Y.-H. Yang, “DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and a Comprehensive Evaluation,” arXiv preprint arXiv:2208.04756, 2022.
  12. S.-H. Lee, H.-R. Noh, W.-J. Nam, and S.-W. Lee, “Duration Controllable Voice Conversion via Phoneme-based Information Bottleneck,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1173–1183, 2022.
  13. Y. Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, and M. Bi, “ViSinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 7237–7241.
  14. X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He et al., “NaturalSpeech: End-to-End Text to Speech Synthesis with Human-level Quality,” arXiv preprint arXiv:2205.04421, 2022.
  15. A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv preprint arXiv:1609.03499, 2016.
  16. R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A Fast Waveform Generation Model based on Generative Adversarial Networks with Multi-resolution Spectrogram,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2020, pp. 6199–6203.
  17. Y. Ai and Z.-H. Ling, “A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 839–851, 2020.
  18. K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative Adversarial Networks for Conditional Waveform Synthesis,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  19. J. Kong, J. Kim, and J. Bae, “Hifi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
  20. R. Huang, M. W. Lam, J. Wang, D. Su, D. Yu, Y. Ren, and Z. Zhao, “FastDiff: A Fast Conditional Diffusion Model for High-quality Speech Synthesis,” arXiv preprint arXiv:2204.09934, 2022.
  21. N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An End-to-End Neural Audio Codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  22. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High Fidelity Neural Audio Compression,” arXiv preprint arXiv:2210.13438, 2022.
  23. J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
  24. P. Dhariwal and A. Nichol, “Diffusion Models Beat GANs on Image Synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021.
  25. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-conditional Image Generation with Clip Latents,” arXiv preprint arXiv:2204.06125, 2022.
  26. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution Image Synthesis with Latent Diffusion Models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695.
  27. D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, “Diffsound: Discrete Diffusion model for Text-to-Sound Generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  28. R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-an-Audio: Text-to-Audio Generation with Prompt-Enhanced Diffusion Models,” arXiv preprint arXiv:2301.12661, 2023.
  29. J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video Diffusion Models,” arXiv preprint arXiv:2204.03458, 2022.
  30. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8599–8608.
  31. H.-Y. Choi, S.-H. Lee, and S.-W. Lee, “DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion,” arXiv preprint arXiv:2305.15816, 2023.
  32. Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A Versatile Diffusion Model for Audio Synthesis,” arXiv preprint arXiv:2009.09761, 2020.
  33. N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating Gradients for Waveform Generation,” arXiv preprint arXiv:2009.00713, 2020.
  34. N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, N. Dehak, and W. Chan, “WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis,” arXiv preprint arXiv:2106.09660, 2021.
  35. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based Generative Modeling through Stochastic Differential Equations,” arXiv preprint arXiv:2011.13456, 2020.
  36. A. Vahdat, K. Kreis, and J. Kautz, “Score-based Generative Modeling in Latent Space,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 287–11 302, 2021.
  37. H. Kim, S. Kim, and S. Yoon, “Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance,” in International Conference on Machine Learning.   PMLR, 2022, pp. 11 119–11 133.
  38. A. Van Den Oord, O. Vinyals et al., “Neural Discrete Representation Learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  39. W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A Neural Vocoder with Multi-resolution Spectrogram Discriminators for High-fidelity Waveform Generation,” arXiv preprint arXiv:2106.07889, 2021.
  40. A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond Pixels Using a Learned Similarity Metric,” in International Conference on Machine Learning.   PMLR, 2016, pp. 1558–1566.
  41. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in International Conference on Machine Learning, 2006, pp. 369–376.
  42. J. Lee, H.-S. Choi, C.-B. Jeon, J. Koo, and K. Lee, “Adversarially Trained End-to-End Korean Singing Voice Synthesis System,” arXiv preprint arXiv:1908.01919, 2019.
  43. Y. Shirahata, R. Yamamoto, E. Song, R. Terashima, J.-M. Kim, and K. Tachibana, “Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis,” arXiv preprint arXiv:2210.15964, 2022.
  44. S.-G. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “Priorgrad: Improving Conditional Denoising Diffusion Models with Data-driven Adaptive Prior,” arXiv preprint arXiv:2106.06406, 2021.
  45. H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural Analysis and Synthesis: Reconstructing Speech from Self-supervised Representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 16 251–16 265, 2021.
  46. S.-H. Lee, S.-B. Kim, J.-H. Lee, E. Song, M.-J. Hwang, and S.-W. Lee, “HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis,” in Advances in Neural Information Processing Systems, 2022.
  47. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” in International Conference on Machine Learning.   PMLR, 2020, pp. 1597–1607.
  48. K. Qian, Y. Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, M. Hasegawa-Johnson, and S. Chang, “Contentvec: An Improved Self-supervised Speech Representation by Disentangling Speakers,” in International Conference on Machine Learning.   PMLR, 2022, pp. 18 003–18 017.
  49. S. Choi, W. Kim, S. Park, S. Yong, and J. Nam, “Children’s Song Dataset for Singing Voice Research,” in International Society for Music Information Retrieval Conference (ISMIR), 2020.
  50. A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al., “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” arXiv preprint arXiv:2111.09296, 2021.
  51. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-supervised Learning of Speech Representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
  52. I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” arXiv preprint arXiv:1711.05101, 2017.
  53. J. Donahue, S. Dieleman, M. Bińkowski, E. Elsen, and K. Simonyan, “End-to-End Adversarial Text-to-Speech,” arXiv preprint arXiv:2006.03575, 2020.
  54. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and High-quality End-to-End Text to Speech,” arXiv preprint arXiv:2006.04558, 2020.
  55. J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5530–5540.
  56. R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A Flow-based Generative Network for Speech Synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 3617–3621.
  57. P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with Relative Position Representations,” arXiv preprint arXiv:1803.02155, 2018.
  58. J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A Generative Flow for Text-to-Speech via Monotonic Alignment Search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
  59. M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio, “Chunked Autoregressive GAN for Conditional Waveform Synthesis,” arXiv preprint arXiv:2110.10139, 2021.
  60. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual Evaluation of Speech Quality (PESQ)-A New Method for Speech Quality Assessment of Telephone Networks and Codecs,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2.   IEEE, 2001, pp. 749–752.
  61. M. Müller, “Dynamic Time Warping,” Information retrieval for music and motion, pp. 69–84, 2007.
  62. K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers,” arXiv preprint arXiv:2304.09116, 2023.
  63. M. W. Lam, Q. Tian, T. Li, Z. Yin, S. Feng, M. Tu, Y. Ji, R. Xia, M. Ma, X. Song et al., “Efficient Neural Music Generation,” arXiv preprint arXiv:2305.15719, 2023.
  64. Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency Models,” arXiv preprint arXiv:2303.01469, 2023.
Citations (7)

Summary

We haven't generated a summary for this paper yet.