Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Singing Voice Transcription Serves Synthesis (2405.09940v2)

Published 16 May 2024 in eess.AS and cs.SD

Abstract: Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  2. Bhuwan Bhattarai and Joonwhoan Lee. 2023. A comprehensive review on music transcription. Applied Sciences, 13(21):11882.
  3. Hifisinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776.
  4. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979.
  5. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
  6. Hierarchical classification networks for singing voice segmentation and transcription. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR 2019), pages 900–907.
  7. Emilia Gómez and Jordi Bonada. 2013. Towards computer-assisted flamenco transcription: An experimental comparison of automatic transcription algorithms as applied to a cappella singing. Computer Music Journal, 37(2):73–90.
  8. Deep audio-visual singing voice transcription based on self-supervised learning models. arXiv preprint arXiv:2304.12082.
  9. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
  10. Rmssinger: Realistic-music-score based singing voice synthesis. arXiv preprint arXiv:2305.10686.
  11. Vocano: A note transcription framework for singing voice in polyphonic music.
  12. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3945–3954.
  13. Singgan: Generative adversarial network for high-fidelity singing voice generation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2525–2535.
  14. Vits: Conditional variational autoencoder with adversarial learning for end-to-end text-tospeech. In Proc. ICML, pages 5530–5540.
  15. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033.
  16. Alignsts: Speech-to-singing conversion via cross-modal alignment. arXiv preprint arXiv:2305.04476.
  17. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988.
  18. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11020–11028.
  19. Computer-aided melody note transcription using the tony software: Accuracy and efficiency.
  20. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
  21. Evaluation framework for automatic singing transcription.
  22. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  23. mir_eval: A transparent implementation of common mir metrics. In In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR. Citeseer.
  24. Deepsinger: Singing voice synthesis with data mined from the web. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1979–1989.
  25. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer.
  26. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
  27. What do audio transformers hear? probing their representations for language delivery & structure. In 2022 IEEE International Conference on Data Mining Workshops (ICDMW), pages 910–925. IEEE.
  28. MUSAN: A Music, Speech, and Noise Corpus. ArXiv:1510.08484v1.
  29. Jun-You Wang and Jyh-Shing Roger Jang. 2021. On the preparation and validation of a large-scale dataset of singing transcription. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 276–280. IEEE.
  30. Musicyolo: A sight-singing onset/offset detection framework based on object detection instead of spectrum frames. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 396–400. IEEE.
  31. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429.
  32. Rmvpe: A robust model for vocal pitch estimation in polyphonic music. arXiv preprint arXiv:2306.15412.
  33. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704.
  34. A phoneme-informed neural network model for note-level singing transcription. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  35. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. Advances in Neural Information Processing Systems, 35:6914–6926.
  36. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7237–7241.
  37. Stylesinger: Style transfer for out-of-domain singing voice synthesis. arXiv preprint arXiv:2312.10741.
  38. Wesinger: Data-augmented singing voice synthesis with auxiliary losses. arXiv preprint arXiv:2203.10750.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ruiqi Li (44 papers)
  2. Yu Zhang (1400 papers)
  3. Yongqi Wang (24 papers)
  4. Zhiqing Hong (13 papers)
  5. Rongjie Huang (62 papers)
  6. Zhou Zhao (219 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.