Singer Identity Representation Learning using Self-Supervised Techniques (2401.05064v1)
Abstract: Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different self-supervised learning techniques on a large collection of isolated vocal tracks and apply data augmentations during training to ensure that the representations are invariant to pitch and content variations. We evaluate the quality of the resulting representations on singer similarity and identification tasks across multiple datasets, with a particular emphasis on out-of-domain generalization. Our proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz. We release our code and trained models to facilitate further research on singing voice and related areas.
- S. Wang, J. Liu, Y. Ren, Z. Wang, C. Xu, and Z. Zhao, “MR-SVS: Singing voice synthesis with multi-reference encoder,” CoRR, vol. abs/2201.03864, 2022.
- S. Nercessian, “End-to-End Zero-Shot Voice Conversion Using a DDSP Vocoder,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 2021, pp. 1–5.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning, 2020, pp. 1597–1607.
- F. Wang and H. Liu, “Understanding the Behaviour of Contrastive Loss,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, Jun. 2021, pp. 2495–2504.
- A. Bardes, J. Ponce, and Y. LeCun, “VICReg: Variance-invariance-covariance regularization for self-supervised learning,” in ICLR, 2022.
- J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent - A new approach to self-supervised learning,” in NeurIPS, 2020.
- S. Ternström, “Hi-Fi voice: Observations on the distribution of energy in the singing voice spectrum above 5 kHz,” in Acoustics’ 08, Paris, France, Jun 29-Jul 4, 2008, 2008, pp. 3171–3176.
- B. B. Monson, E. J. Hunter, A. J. Lotto, and B. H. Story, “The perceptual significance of high-frequency energy in the human voice,” Frontiers in psychology, vol. 5, p. 587, 2014.
- Md. Sahidullah, S. Chakroborty, and G. Saha, “On the use of perceptual Line Spectral pairs Frequencies and higher-order residual moments for Speaker Identification,” IJBM, vol. 2, no. 4, p. 358, 2010.
- L. Regnier and G. Peeters, “Singer verification: Singer model .vs. song model,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan: IEEE, 2012, pp. 437–440.
- T. Nakano, K. Yoshii, and M. Goto, “Vocal timbre analysis using latent Dirichlet allocation and cross-gender vocal timbre similarity,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 5202–5206.
- A. Mesaros, T. Virtanen, and A. Klapuri, “Singer identification in polyphonic music using vocal separation and pattern recognition methods.” in Proc. of the 8th International Society for Music Information Retrieval Conference (ISMIR), 2007, pp. 375–378.
- M. Lagrange, A. Ozerov, and E. Vincent, “Robust singer identification in polyphonic music using melody enhancement and uncertainty-based learning,” in Proc. of the 13th International Society for Music Information Retrieval Conference (ISMIR), 2012.
- B. Sharma, R. K. Das, and H. Li, “On the Importance of Audio-Source Separation for Singer Identification in Polyphonic Music,” in Interspeech 2019. ISCA, Sep. 2019, pp. 2020–2024.
- T.-H. Hsieh, K.-H. Cheng, Z.-C. Fan, Y.-C. Yang, and Y.-H. Yang, “Addressing the confounds of accompaniments in singer identification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1–5.
- N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Apr. 2018, pp. 5329–5333.
- J.-w. Jung, Y. J. Kim, H.-S. Heo, B.-J. Lee, Y. Kwon, and J. S. Chung, “Pushing the limits of raw waveform speaker recognition,” in Interspeech. ISCA, 2022, pp. 2228–2232.
- M. Sang, H. Li, F. Liu, A. O. Arnold, and L. Wan, “Self-supervised speaker verification with simple siamese network and self-supervised regularization,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6127–6131.
- W. Xia, C. Zhang, C. Weng, M. Yu, and D. Yu, “Self-supervised text-independent speaker verification using prototypical momentum contrastive learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6723–6727.
- T. Lepage and R. Dehak, “Label-efficient self-supervised speaker verification with information maximization and contrastive learning,” in Proc. Interspeech 2022. ISCA, Sep. 2022, pp. 4018–4022.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3875–3879.
- H. Al-Tahan and Y. Mohsenzadeh, “Clar: Contrastive learning of auditory representations,” in International Conference on Artificial Intelligence and Statistics, 2021, pp. 2530–2538.
- J. Spijkervet and J. A. Burgoyne, “Contrastive learning of musical representations,” in Proc. of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 673–681.
- C.-i. Wang and G. Tzanetakis, “Singing Style Investigation by Residual Siamese Convolutional Neural Networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 116–120.
- H. Yakura, K. Watanabe, and M. Goto, “Self-Supervised Contrastive Learning for Singing Voices,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 30, pp. 1614–1623, 2022.
- B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech. ISCA, 2020, pp. 3830–3834.
- J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” in Interspeech 2020, Oct. 2020, pp. 2977–2981.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, 2019, pp. 6105–6114.
- C.-H. Yeh, C.-Y. Hong, Y.-C. Hsu, T.-L. Liu, Y. Chen, and Y. LeCun, “Decoupled contrastive learning,” in ECCV (26), ser. Lecture Notes in Computer Science, vol. 13686. Springer, 2022, pp. 668–684.
- J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),” 2019.
- Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in APSIPA. IEEE, 2013, pp. 1–9.
- J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo, “VocalSet: A Singing Voice Dataset.” in Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 468–474.
- L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He, R. Huang, J. Zhu, X. Chen et al., “M4Singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,” Advances in Neural Information Processing Systems, vol. 35, pp. 6914–6926, 2022.
- S. Lattner, “SampleMatch: Drum sample retrieval by musical context,” in Proc. of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 781–788.
- D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for audio: Exploring pre-trained general-purpose audio representations,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 137–151, 2023.
- S.-W. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K.-t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB: Speech processing universal PERformance benchmark,” in Interspeech. ISCA, 2021, pp. 1194–1198.
- L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
- Y. Kwon, H.-S. Heo, B.-J. Lee, and J. S. Chung, “The ins and outs of speaker recognition: Lessons from VoxSRC 2020,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5809–5813.
- A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Interspeech. ISCA, 2021, pp. 2426–2430.
- H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 16 251–16 265, 2021.
- P. Boersma and D. Weenink, “Praat: Doing phonetics by computer (Version 5.1.13),” 2009.
- Y. Jadoul, B. Thompson, and B. de Boer, “Introducing Parselmouth: A Python interface to Praat,” Journal of Phonetics, vol. 71, pp. 1–15, Nov. 2018.
- Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 on Speaker Verification and Language Identification,” in Interspeech 2021. ISCA, Aug. 2021, pp. 1509–1513.
- Bernardo Torres (4 papers)
- Stefan Lattner (33 papers)
- Gaël Richard (46 papers)