More than words: Advancements and challenges in speech recognition for singing (2403.09298v1)
Abstract: This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. Singing encompasses unique challenges, including extensive pitch variations, diverse vocal styles, and background music interference. We explore key areas such as phoneme recognition, language identification in songs, keyword spotting, and full lyrics transcription. I will describe some of my own experiences when performing research on these tasks just as they were starting to gain traction, but will also show how recent developments in deep learning and large-scale datasets have propelled progress in this field. My goal is to illuminate the complexities of applying speech recognition to singing, evaluate current capabilities, and outline future research directions.
- IEEE Signal Processing Magazine, 36(1), pp. 82–94, 2018.
- Hansen, J. K.: Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In Sound and Music Computing Conference (SMC). 2012.
- IEEE/ACM Transactions on Audio, Speech, and Language Processing, 20(1), pp. 200–210, 2012.
- Kruspe, A. M.: Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing. In 17th International Society for Music Information Retrieval Conference (ISMIR). New York, NY, USA, 2016a.
- Roa Dabike, G. and J. Barker: Automatic lyric transcription from karaoke vocal tracks: Resources and a baseline system. In Interspeech. 2019. doi:10.21437/Interspeech.2019-2378.
- IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, pp. 2382–2395, 2021. doi:10.1109/TASLP.2021.3091817.
- CoRR, abs/2110.05580, 2021. URL https://arxiv.org/abs/2110.05580. 2110.05580.
- In 19th International Society for Music Information Retrieval Conference. Paris, France, 2018. Hal-02019115.
- Transactions of the International Society for Music Information Retrieval, 3(1), pp. 55–67, 2020. doi:10.5334/tismir.30.
- Journal of Voice, 14(2), pp. 287–298, 2000.
- In Interspeech. 2003.
- In International Society for Music Information Retrieval Conference (ISMIR). 2005.
- Mesaros, A. and T. Virtanen: Automatic recognition of lyrics in singing. EURASIP Journal on Audio, Speech, and Music Processing, 2010, pp. 1–11, 2010a.
- In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014.
- In NASA STI/Recon Technical Report N, vol. 93. National Institute of Standards and Technology (NIST), 1993.
- Kruspe, A. M.: Training phoneme models for singing with ”songified” speech data. In 16th International Society for Music Information Retrieval Conference (ISMIR). Malaga, Spain, 2015a.
- ArXiv, abs/2109.07940, 2021. URL https://api.semanticscholar.org/CorpusID:237532132.
- Tsai, W.-H. and H.-M. Wang: Towards automatic identification of singing language in popular music recordings. In International Society for Music Information Retrieval Conference (ISMIR). 2004.
- In International Society for Music Information Retrieval Conference (ISMIR). 2006.
- Mehrabani, M. and J. H. L. Hansen: Language identification for singing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011.
- In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011.
- In Proc. of the AES Conference on Semantic Audio. London, UK, 2014.
- Kruspe, A. M.: Improving singing language identification through i-vector extraction. In Proc. of the 17th Int. Conference on Digital Audio Effects (DAFx-14). Erlangen, Germany, 2014a.
- Kruspe, A. M.: Phonotactic language identification for singing. In Interspeech. San Francisco, CA, USA, 2016b.
- In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271–275. 2021. doi:10.1109/ICASSP39728.2021.9414203.
- In International Society for Music Information Retrieval Conference. 2021.
- In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 399–404. IEEE, 2020.
- In International Society for Music Information Retrieval Conference (ISMIR). 2008.
- Nakano, T. and M. Goto: Lyriclistplayer: A consecutive-query-by-playback interface for retrieving similar word sequences from different song lyrics. In Sound and Music Computing Conference (SMC). 2016.
- In 3rd International Workshop on Cognitive Information Processing (CIP). 2012.
- In International Society for Music Information Retrieval Conference (ISMIR). 2015.
- Kruspe, A. M.: Keyword spotting in a-capella singing. In 15th International Society for Music Information Retrieval Conference (ISMIR). Taipei, Taiwan, 2014b.
- Kruspe, A. M.: Keyword spotting in singing with duration-modeled hmms. In European Signal Processing Conference (EUSIPCO). Nice, France, 2015b.
- In International Computer Music Conference (ICMC). 1999.
- In ACM International Conference on Multimedia. 2004.
- In SPIE Multimedia Computing and Networking. 2006.
- In Sound and Music Computing Conference (SMC). 2010.
- Multimedia Systems, 12(4-5), pp. 307–323, 2007.
- Lee, K. and M. Cremer: Segmentation-based lyrics-audio alignment using dynamic programming. In International Society for Music Information Retrieval Conference (ISMIR). 2008.
- Mesaros, A. and T. Virtanen: Automatic alignment of music audio and lyrics. In International Conference on Digital Audio Effects (DAFx-08). 2008.
- In Interspeech. 2015.
- In International Society for Music Information Retrieval Conference (ISMIR). 2016.
- In 18th International Society for Music Information Retrieval Conference (ISMIR) (MIREX submission). Suzhou, China, 2017.
- 2019. 1906.10369.
- 2019. 1902.06797.
- In International Society for Music Information Retrieval Conference (ISMIR). Montreal, Canada, 2020. URL https://hal.science/hal-02996940.
- 2021a. 2102.09202.
- In International Society for Music Information Retrieval Conference (ISMIR). 2010.
- Mesaros, A. and T. Virtanen: Recognition of phonemes and words in singing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2010b.
- Kruspe, A. M.: Retrieval of textual song lyrics from sung inputs. In Interspeech. San Francisco, CA, USA, 2016c.
- Kruspe, A. M. and M. Goto: Retrieval of song lyrics from sung queries. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada, 2018.
- In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5749–5753. 2018. doi:10.1109/ICASSP.2018.8462247.
- 2020. 2007.06486.
- 2021b. 2108.02625.
- 2022. 2204.03307.
- 2022. 2207.09747.
- 2023. 2306.17103.