SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion (2402.12660v2)
Abstract: In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also facilitates side-by-side comparisons of different conditions, such as source content, melody, and target timbre, highlighting the impact of these conditions on the diffusion generation process and resulting conversions. Through comparative and comprehensive evaluations, SingVisio demonstrates its effectiveness in terms of system design, functionality, explainability, and user-friendliness. It offers users of various backgrounds valuable learning experiences and insights into the diffusion model for singing voice conversion.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys 2023;56(4):1–39.
- Text-to-image diffusion model in generative AI: A survey. arXiv preprint arXiv:230307909 2023a;.
- A survey on video diffusion models. arXiv preprint arXiv:231010647 2023;.
- Versatile diffusion: Text, images and variations all in one diffusion model. In: IEEE/CVF International Conference on Computer Vision. 2023, p. 7754–7765.
- High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 10684–10695.
- Pix2video: Video editing using image diffusion. In: IEEE/CVF International Conference on Computer Vision. 2023, p. 23206–23217.
- WaveGrad: Estimating gradients for generative audio modeling. In: International Conference on Learning Representations. 2021,.
- DiffWave: A versatile diffusion model for audio synthesis. In: International Conference on Learning Representations. 2020,.
- AudioLDM: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:230112503 2023;.
- Make-An-Audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:230518474 2023a;.
- Grad-TTS: A diffusion probabilistic model for text-to-speech. In: International Conference on Machine Learning. 2021, p. 8599–8608.
- NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. In: International Conference on Learning Representations. 2024,.
- DiffSinger: Singing voice synthesis via shallow diffusion mechanism. In: AAAI Conference on Artificial Intelligence. 2022, p. 11020–11028.
- Moûsai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:230111757 2023;.
- DiffSVC: A diffusion probabilistic model for singing voice conversion. In: Automatic Speech Recognition and Understanding Workshop. IEEE; 2021a, p. 741–748.
- Leveraging content-based features from multiple acoustic models for singing voice conversion. Machine Learning for Audio Workshop, Neural Information Processing Systems 2023b;.
- Comosvc: Consistency model-based singing voice conversion. arXiv preprint arXiv:240101792 2024;.
- Generative adversarial networks. Communications of the ACM 2020;63(11):139–144.
- Auto-encoding variational bayes. In: Bengio, Y, LeCun, Y, editors. International Conference on Learning Representations. 2014,.
- Sergios Karagiannakos, NA. Diffusion models: toward state-of-the-art image generation. https://theaisummer.com/diffusion-models/; 2022.
- O’Connor, R. Diffusion models for machine learning: Introduction. https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/; 2024.
- Application of voice conversion for cross-language rap singing transformation. In: International Conference on Acoustics, Speech and Signal Processing. 2009, p. 3597–3600.
- Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In: International Speech Communication Association. 2014, p. 2514–2518.
- Statistical singing voice conversion based on direct waveform modification with global variance. In: International Speech Communication Association. 2015, p. 2754–2758.
- Unsupervised singing voice conversion. In: International Speech Communication Association. 2019, p. 2583–2587.
- Singing voice conversion with non-parallel data. In: Multimedia Information Processing and Retrieval. 2019, p. 292–296.
- A comparative study of self-supervised speech representation based voice conversion. IEEE Journal of Selected Topics in Signal Processing 2022;16(6):1308–1318.
- FastSVC: Fast cross-domain singing voice conversion with feature-wise linear modulation. In: International Conference on Multimedia and Expo. 2021b, p. 1–6.
- Robust one-shot singing voice conversion. arXiv 2022;abs/2210.11096.
- Singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders. In: International Conference on Acoustics, Speech and Signal Processing. 2020, p. 3277–3281.
- SVC-Develop-Team, . SoftSVC VITS Singing Voice Conversion. https://github.com/svc-develop-team/so-vits-svc; 2023.
- Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In: International Conference on Learning Representations. 2022,.
- Diff-HierVC: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation. In: International Speech Communication Association. 2023, p. 2283–2287.
- AUDIT: audio editing by following instructions with latent diffusion models. In: Neural Information Processing Systems. 2022a,.
- Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion 2020;58:82–115.
- Visual analytics in deep learning: An interrogative survey for the next frontiers. IEEE Transactions on Visualization and Computer Graphics 2018;25(8):2674–2693.
- CNN Explainer: learning convolutional neural networks with interactive visualization. IEEE Transactions on Visualization and Computer Graphics 2020;27(2):1396–1406.
- LSTMVis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Transactions on Visualization and Computer Graphics 2017;24(1):667–676.
- DQNViz: A visual analytics approach to understand deep q-networks. IEEE Transactions on Visualization and Computer Graphics 2018a;25(1):288–298.
- M2Lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics 2021;28(1):802–812.
- Towards better analysis of deep convolutional neural networks. IEEE Transactions on Visualization and Computer Graphics 2016;23(1):91–100.
- AttentionViz: A global view of transformer attention. arXiv preprint arXiv:230503210 2023;.
- Adversarial-playground: A visualization suite showing how adversarial examples fool deep learning. In: IEEE Symposium on Visualization for Cyber Security. 2017, p. 1–4.
- GAN Lab: Understanding complex deep generative models using interactive visual experimentation. IEEE Transactions on Visualization and Computer Graphics 2018;25(1):310–320.
- GANViz: A visual analytics approach to understand the adversarial game. IEEE Transactions on Visualization and Computer Graphics 2018b;24(6):1905–1917.
- Extending the nested model for user-centric xai: A design study on gnn-based drug repurposing. IEEE Transactions on Visualization and Computer Graphics 2022b;29(1):1266–1276.
- Explaining generative diffusion models via visual analysis for interpretable decision-making process. Expert Systems with Applications 2024;:123231.
- Diffusion explainer: Visual explanation for text-to-image stable diffusion. arXiv preprint arXiv:230503509 2023;.
- The singing voice conversion challenge 2023. In: Automatic Speech Recognition and Understanding Workshop. 2023b, p. 1–8.
- Amphion: An open-source audio, music and speech generation toolkit. arXiv 2023c;abs/2312.09911.
- WaveNet: A generative model for raw audio. In: Speech Synthesis Workshop. ISCA; 2016, p. 125.
- Denoising diffusion probabilistic models. Neural Information Processing Systems 2020;33:6840–6851.
- Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. PMLR; 2023, p. 28492–28518.
- ContentVec: An improved self-supervised speech representation by disentangling speakers. In: International Conference on Machine Learning. PMLR; 2022, p. 18003–18017.
- Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics 2018;71:1–15.
- An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2020;29:132–157.
- Evaluation: A systematic approach. Canadian Journal of University Continuing Education 2010;36(2).
- Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. In: International Speech Communication Association. 2022c, p. 4242–4246.
- CSTR VCTK Corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). University of Edinburgh The Centre for Speech Technology Research 2019;.
- Multi-Singer: Fast multi-singer singing voice vocoder with A large-scale corpus. In: ACM International Conference on Multimedia. ACM; 2021, p. 3945–3954.
- M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. In: Neural Information Processing Systems. 2022,.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.