Musical Word Embedding for Music Tagging and Retrieval (2404.13569v2)
Abstract: Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in the domain of music, the word embedding may have difficulty understanding musical contexts or recognizing music-related entities like artists and tracks. To address this issue, we propose a new approach called Musical Word Embedding (MWE), which involves learning from various types of texts, including both everyday and music-related vocabulary. We integrate MWE into an audio-word joint representation framework for tagging and retrieving music, using words like tag, artist, and track that have different levels of musical specificity. Our experiments show that using a more specific musical word like track results in better retrieval performance, while using a less specific term like tag leads to better tagging performance. To balance this compromise, we suggest multi-prototype training that uses words with different levels of musical specificity jointly. We evaluate both word embedding and audio-word joint embedding on four tasks (tag rank prediction, music tagging, query-by-tag, and query-by-track) across two datasets (Million Song Dataset and MTG-Jamendo). Our findings show that the suggested MWE is more efficient and robust than the conventional word embedding.
- D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Semantic annotation and retrieval of music and sound effects,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 467–476, 2008.
- M. Prockup, A. F. Ehmann, F. Gouyon, E. M. Schmidt, Ò. Celma, and Y. E. Kim, “Modeling genre with the Music Genome Project: Comparing human-labeled attributes and audio features,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2015.
- J. Nam, K. Choi, J. Lee, S.-Y. Chou, and Y.-H. Yang, “Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach,” IEEE signal processing magazine, vol. 36, no. 1, pp. 41–51, 2018.
- K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging using deep convolutional neural networks,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.
- J. Pons, O. Nieto, M. Prockup, E. M. Schmidt, A. F. Ehmann, and X. Serra, “End-to-end learning for music audio tagging at scale,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018.
- J. Lee, J. Park, K. L. Kim, and J. Nam, “Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms,” in Proceedings of the Sound and Music Computing Conference (SMC), 2017.
- M. Won, S. Chun, O. Nieto, and X. Serra, “Data-driven harmonic filters for audio representation learning,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 536–540.
- M. Won, A. Ferraro, D. Bogdanov, and X. Serra, “Evaluation of cnn-based automatic music tagging models,” in Proceedings of Sound and Music Computing, 2020.
- T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million song dataset,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2011.
- J. Choi, J. Lee, J. Park, and J. Nam, “Zero-shot learning for audio-based music classification and tagging,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2019.
- M. Won, S. Oramas, O. Nieto, F. Gouyon, and X. Serra, “Multimodal metric learning for tag-based music retrieval,” arXiv preprint arXiv:2010.16030, 2020.
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of Advances in neural information processing systems, 2013, pp. 3111–3119.
- J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
- Y. Zhang, Q. Chen, Z. Yang, H. Lin, and Z. Lu, “BioWordVec, improving biomedical word embeddings with subword information and MeSH,” Scientific data, vol. 6, no. 1, p. 52, 2019. [Online]. Available: http://dx.doi.org/10.1038/s41597-019-0055-0
- A. Schindler and P. Knees, “Multi-task music representation learning from multi-label embeddings,” in Proceedings of International Conference on Content-Based Multimedia Indexing (CBMI). IEEE, 2019, pp. 1–6.
- K. Watanabe and M. Goto, “Query-by-blending: a music exploration system blending latent vector representations of lyric word, song audio, and artist,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2019, pp. 144–151.
- M. Slaney, K. Q. Weinberger, and W. White, “Learning a metric for music similarity,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2008.
- J. Park, J. Lee, J. Park, J.-W. Ha, and J. Nam, “Representation learning of music using artist labels,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2017.
- J. Lee, J. Park, and J. Nam, “Representation learning of music using artist, album, and track information,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2019.
- B. McFee and G. R. G. Lanckriet, “Heterogeneous embedding for subjective artist similarity,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2009.
- B. Mcfee, L. Barrington, and G. R. Lanckriet, “Learning similarity from collaborative filters,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2010.
- D. Wolff, S. Stober, A. Nürnberger, and T. Weyde, “A Systematic Comparison of Music Similarity Adaptation Approaches.” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2012, pp. 103–108.
- J. Lee, N. J. Bryan, J. Salamon, Z. Jin, and J. Nam, “Disentangled multidimensional metric learning for music similarity,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6–10.
- ——, “Metric learning vs classification for disentangled music representation learning,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2020.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proceedings of International Conference on Learning Representations,Workshop Track Proceedings,ICLR, 2013.
- R. He and J. McAuley, “Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering,” in Proceedings of the 25th International Conference on World Wide Web, 2016, p. 507–517.
- S. Oramas, O. Nieto, F. Barbieri, and X. Serra, “Multi-label music genre classification from audio, text, and images using deep features,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2017, pp. 23–30.
- Y. Xian, B. Schiele, and Z. Akata, “Zero-shot learning-the good, the bad and the ugly,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4582–4591.
- A. Salle, A. Villavicencio, and M. Idiart, “Matrix factorization using window sampling and negative sampling for improved word representations,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 419–424. [Online]. Available: https://aclanthology.org/P16-2068
- M. Won, K. Choi, and X. Serra, “Semi-supervised music tagging transformer,” in Proceedings of of International Society for Music Information Retrieval, 2021.
- J. Choi, J. Lee, J. Park, and J. Nam, “Zero-shot learning and knowledge transfer in music classification and tagging,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2019.
- D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The mtg-jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2019.
- S. Doh, M. Won, K. Choi, and J. Nam, “Toward universal text-to-music retrieval,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.