Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PrimaDNN': A Characteristics-aware DNN Customization for Singing Technique Detection (2306.14191v1)

Published 25 Jun 2023 in cs.SD, cs.LG, and eess.AS

Abstract: Professional vocalists modulate their voice timbre or pitch to make their vocal performance more expressive. Such fluctuations are called singing techniques. Automatic detection of singing techniques from audio tracks can be beneficial to understand how each singer expresses the performance, yet it can also be difficult due to the wide variety of the singing techniques. A deep neural network (DNN) model can handle such variety; however, there might be a possibility that considering the characteristics of the data improves the performance of singing technique detection. In this paper, we propose PrimaDNN, a CRNN model with a characteristics-oriented improvement. The features of the model are: 1) input feature representation based on auxiliary pitch information and multi-resolution mel spectrograms, 2) Convolution module based on the Squeeze-and-excitation (SENet) and the Instance normalization. In the results of J-POP singing technique detection, PrimaDNN achieved the best results of 44.9% at the overall macro-F measure, compared to conventional works. We also found that the contribution of each component varies depending on the type of singing technique.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Y. Yamamoto, J. Nam, and H. Terasawa, “Analysis and detection of singing techniques in repertoires of j-pop solo singers,” in In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 384–391.
  2. S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34–48, 2018.
  3. R. Nishikimi, E. Nakamura, M. Goto, and K. Yoshii, “Audio-to-score singing transcription based on a crnn-hsmm hybrid model,” APSIPA Transactions on Signal and Information Processing, vol. 10, p. e7, 2021.
  4. K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Convolutional recurrent neural networks for music classification,” in 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2017, pp. 2392–2396.
  5. Y. Yamamoto, J. Nam, H. Terasawa, and Y. Hiraga, “Investigating time-frequency representations for audio feature extraction in singing technique classification,” in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2021.
  6. T.-H. Hsieh, K.-H. Cheng, Z.-C. Fan, Y.-C. Yang, and Y.-H. Yang, “Addressing the confounds of accompaniments in singer identification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).   IEEE, 2020, pp. 1–5.
  7. J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A convolutional representation for pitch estimation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 161–165.
  8. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  9. D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
  10. X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning and generalization capacities via ibn-net,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 464–479.
  11. C. Zhang and L. Xue, “Autoencoder with emotion embedding for speech emotion recognition,” IEEE access, vol. 9, pp. 51 231–51 241, 2021.
  12. J. chieh Chou and H.-Y. Lee, “One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” in Proc. Interspeech 2019, 2019, pp. 664–668. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2019-2663
  13. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2017, pp. 2980–2988.
  14. A. Défossez, “Hybrid spectrogram and waveform source separation,” in In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021.
  15. A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016.
  16. B. T. Atmaja and M. Akagi, “On the differences between song and speech emotion recognition: Effect of feature sets, feature types, and classifiers,” in 2020 IEEE REGION 10 CONFERENCE (TENCON).   IEEE, 2020, pp. 968–972.
  17. F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015.
  18. K. Imoto, S. Mishima, Y. Arai, and R. Kondo, “Impact of sound duration and inactive frames on sound event detection performance,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).   IEEE, 2021, pp. 860–864.
  19. Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Sound event detection of weakly labelled data with cnn-transformer and automatic threshold optimization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2450–2460, 2020.
  20. L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” in International Conference on Learning Representations, 2019.
  21. Y. Yamamoto, T. Nakano, M. Goto, H. Terasawa, and Y. Hiraga, “Analysis of frequency, acoustic characteristics, and occurrence location of singing techniques using imitated j-pop singing voice,” The Special Interest Group Technical Report of IPSJ (MUS), no. 20, pp. 1–8, 2021, (in Japanese).
  22. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  23. B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.

Summary

We haven't generated a summary for this paper yet.