Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages (2306.07744v1)
Abstract: Lyrics alignment gained considerable attention in recent years. State-of-the-art systems either re-use established speech recognition toolkits, or design end-to-end solutions involving a Connectionist Temporal Classification (CTC) loss. However, both approaches suffer from specific weaknesses: toolkits are known for their complexity, and CTC systems use a loss designed for transcription which can limit alignment accuracy. In this paper, we use instead a contrastive learning procedure that derives cross-modal embeddings linking the audio and text domains. This way, we obtain a novel system that is simple to train end-to-end, can make use of weakly annotated training data, jointly learns a powerful text model, and is tailored to alignment. The system is not only the first to yield an average absolute error below 0.2 seconds on the standard Jamendo dataset but it is also robust to other languages, even when trained on English data only. Finally, we release word-level alignments for the JamendoLyrics Multi-Lang dataset.
- “Automatic alignment of music audio and lyrics,” in International Conference on Digital Audio Effects (DAFx), 2008.
- “Automatic synchronization between lyrics and music cd recordings based on viterbi alignment of segregated vocal signals,” in International Symposium on Multimedia (ISM), 2006.
- Anna Marie Kruspe, Application of automatic speech recognition technologies to singing, Ph.D. thesis, TU Ilmenau, 2018.
- “LyricAlly: automatic synchronization of acoustic musical signals and textual lyrics,” in International conference on Multimedia (ACMMM), 2004.
- “Lyrics-to-audio alignment by unsupervised discovery of repetitive patterns in vowel acoustics,” IEEE Access, vol. 5, pp. 16635–16648, 2017.
- “Automatic lyrics alignment and transcription in polyphonic music: Does background music help?,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
- “Acoustic modeling for automatic lyrics-to-audio alignment,” in Interspeech, 2019.
- “Modeling of phoneme durations for alignment between polyphonic audio and lyrics,” in Sound and Music Computing Conference (SMC), 2015.
- “Low resource audio-to-lyrics alignment from polyphonic music recordings,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
- “Automatic lyrics-to-audio alignment on polyphonic music using singing-adapted acoustic models,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
- “DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm,” in International Society for Music Information Retrieval (ISMIR), 2018.
- Jeffrey C Smith, Correlation analyses of encoded music performance, Ph.D. thesis, Stanford University, 2013.
- “Data cleansing with contrastive learning for vocal note event annotations,” in International Society for Music Information Retrieval (ISMIR), 2020.
- “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018.
- “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, 2020.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in International Conference on Machine learning (ICML), 2006.
- Lawrence R Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
- “End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
- “Multilingual lyrics-to-audio alignment,” in International Society for Music Information Retrieval Conference (ISMIR), 2020.
- “Phoneme-to-audio alignment with recurrent neural networks for speaking and singing voice,” in Interspeech, 2021.
- “Improving lyrics alignment through joint pitch detection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
- “Wav2clip: Learning robust audio representations from clip,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
- “Seeing wake words: Audio-visual keyword spotting,” in British Machine Vision Virtual Conference (BMVC), 2020.
- “Phoneme level lyrics alignment and text-informed singing voice separation,” Transactions on Audio, Speech, and Language Processing (TASLP), vol. 29, pp. 2382–2395, 2021.
- “Weakly supervised video moment retrieval from text queries,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- “On-line audio-to-lyrics alignment based on a reference performance,” in International Society for Music Information Retrieval (ISMIR), 2021.
- “Group normalization,” in European conference on computer vision (ECCV), 2018.
- “Bag of tricks for efficient text classification,” arXiv preprint arXiv:1607.01759, 2016.
- “Speech recognition with deep recurrent neural networks,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
- “JamendoLyrics Multi-Lang – an evaluation dataset for multi-language lyrics research,” https://github.com/f90/jamendolyrics/.