Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Musical Word Embedding for Music Tagging and Retrieval (2404.13569v2)

Published 21 Apr 2024 in cs.SD and eess.AS

Abstract: Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in the domain of music, the word embedding may have difficulty understanding musical contexts or recognizing music-related entities like artists and tracks. To address this issue, we propose a new approach called Musical Word Embedding (MWE), which involves learning from various types of texts, including both everyday and music-related vocabulary. We integrate MWE into an audio-word joint representation framework for tagging and retrieving music, using words like tag, artist, and track that have different levels of musical specificity. Our experiments show that using a more specific musical word like track results in better retrieval performance, while using a less specific term like tag leads to better tagging performance. To balance this compromise, we suggest multi-prototype training that uses words with different levels of musical specificity jointly. We evaluate both word embedding and audio-word joint embedding on four tasks (tag rank prediction, music tagging, query-by-tag, and query-by-track) across two datasets (Million Song Dataset and MTG-Jamendo). Our findings show that the suggested MWE is more efficient and robust than the conventional word embedding.

Musical Word Embedding: Enhancing Music Tagging and Retrieval through Domain-Specific Contextualization

Introduction

The proliferation of digital music platforms has benefitted from advancements in music tagging and retrieval, essential components of Music Information Retrieval (MIR). Traditional tagging methods relying on general word embeddings often falter in accurately interpreting domain-specific nuances. In response, this paper introduces a novel approach dubbed Musical Word Embedding (MWE), which distinctively leverages both general and music-specific text corpora to enhance music tagging and retrieval tasks across several benchmarks.

Methodology

Word Embedding Training

The proposed MWE paradigm addresses the contextual gap by incorporating texts varying in musical specificity—from general-purpose documents like Wikipedia entries to music-specific data such as review texts, tags, and artist/track IDs. This comprehensive corpus selection offers a nuanced embedding capable of understanding both broad and niche musical contexts.

For modeling word relationships, the authors employed the skip-gram model, a part of the Word2Vec suite, due to its efficacy in capturing associations between less frequently occurring words, which is beneficial for representing music-specific vocabulary.

  1. General Corpus: Includes basic, non-music-specific words from extensive databases like Wikipedia.
  2. Music-Specific Corpus: Integrates artist and track IDs, and tags from music datasets alongside music reviews, thereby embedding music-related vocabulary and concepts effectively.

The refinement of semantic connections among these words is achieved by maximizing the log probability of contextually related word pairs, a method that enhances the relevance of track and artist IDs when placed within musical discussions.

Audio-Word Joint Embedding

The dual modality embedding employs a metric learning framework, bridging audio and word embeddings by exploiting their contextual similarities. Music tracks and their associated text items (tags, artist IDs, etc.) form triplet networks, used to optimize a max-margin hinge loss function under various supervisory signals:

  • Tag-based supervision
  • Artist and track IDs for heightened musical specificity.

This structured embedding supports robust music tagging and query-by-track functionalities with the benefit of zero-shot learning capabilities, enabling the model to recognize and tag previously unseen music categories and tags.

Evaluation and Results

  1. Datasets Used: The model's efficacy was explored using the Million Song Dataset (MSD) and the MTG-Jamendo dataset, focusing on tasks like tag rank prediction, query-by-tag, and query-by-track functionalities.
  2. Performance Metrics:
    • For word embedding, metrics like normalized discounted cumulative gain (nDCG) and area under the ROC curve (ROCAUC) were calculated to assess tag-to-tag and tag-to-track retrieval accuracy.
    • For audio-word joint embedding, recall metrics at various levels (R@K) and ROCAUC were used to evaluate the retrieval quality and tagging accuracy.
  3. Results:
    • The MWE model outperformed general word embeddings in contextually rich music tagging and retrieval tasks, demonstrating superior handling of genre-specific vocabularies.
    • Audio-word metric learning, especially when trained with artist and track ID supervisions, showed improved performance in predicting and retrieving over both seen and unseen data, harnessing zero-shot learning effectively.

Implications and Future Directions

The development of MWE suggests significant potential for applications in digital music platforms, potentially enhancing user experience through improved recommendation systems and search functionalities. Future work could extend the robustness of MWE to encompass multi-lingual data sets and further explore the integration of other forms of metadata to enrich embedding quality.

By aligning domain-specific text with audio data effectively, MWE sets the stage for more intuitive and context-aware systems in the music information retrieval field, promising exciting developments for both academic research and practical applications in digital music services.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Semantic annotation and retrieval of music and sound effects,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 467–476, 2008.
  2. M. Prockup, A. F. Ehmann, F. Gouyon, E. M. Schmidt, Ò. Celma, and Y. E. Kim, “Modeling genre with the Music Genome Project: Comparing human-labeled attributes and audio features,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2015.
  3. J. Nam, K. Choi, J. Lee, S.-Y. Chou, and Y.-H. Yang, “Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach,” IEEE signal processing magazine, vol. 36, no. 1, pp. 41–51, 2018.
  4. K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging using deep convolutional neural networks,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.
  5. J. Pons, O. Nieto, M. Prockup, E. M. Schmidt, A. F. Ehmann, and X. Serra, “End-to-end learning for music audio tagging at scale,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018.
  6. J. Lee, J. Park, K. L. Kim, and J. Nam, “Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms,” in Proceedings of the Sound and Music Computing Conference (SMC), 2017.
  7. M. Won, S. Chun, O. Nieto, and X. Serra, “Data-driven harmonic filters for audio representation learning,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 536–540.
  8. M. Won, A. Ferraro, D. Bogdanov, and X. Serra, “Evaluation of cnn-based automatic music tagging models,” in Proceedings of Sound and Music Computing, 2020.
  9. T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million song dataset,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2011.
  10. J. Choi, J. Lee, J. Park, and J. Nam, “Zero-shot learning for audio-based music classification and tagging,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2019.
  11. M. Won, S. Oramas, O. Nieto, F. Gouyon, and X. Serra, “Multimodal metric learning for tag-based music retrieval,” arXiv preprint arXiv:2010.16030, 2020.
  12. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of Advances in neural information processing systems, 2013, pp. 3111–3119.
  13. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  14. Y. Zhang, Q. Chen, Z. Yang, H. Lin, and Z. Lu, “BioWordVec, improving biomedical word embeddings with subword information and MeSH,” Scientific data, vol. 6, no. 1, p. 52, 2019. [Online]. Available: http://dx.doi.org/10.1038/s41597-019-0055-0
  15. A. Schindler and P. Knees, “Multi-task music representation learning from multi-label embeddings,” in Proceedings of International Conference on Content-Based Multimedia Indexing (CBMI).   IEEE, 2019, pp. 1–6.
  16. K. Watanabe and M. Goto, “Query-by-blending: a music exploration system blending latent vector representations of lyric word, song audio, and artist,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2019, pp. 144–151.
  17. M. Slaney, K. Q. Weinberger, and W. White, “Learning a metric for music similarity,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2008.
  18. J. Park, J. Lee, J. Park, J.-W. Ha, and J. Nam, “Representation learning of music using artist labels,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2017.
  19. J. Lee, J. Park, and J. Nam, “Representation learning of music using artist, album, and track information,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2019.
  20. B. McFee and G. R. G. Lanckriet, “Heterogeneous embedding for subjective artist similarity,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2009.
  21. B. Mcfee, L. Barrington, and G. R. Lanckriet, “Learning similarity from collaborative filters,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2010.
  22. D. Wolff, S. Stober, A. Nürnberger, and T. Weyde, “A Systematic Comparison of Music Similarity Adaptation Approaches.” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2012, pp. 103–108.
  23. J. Lee, N. J. Bryan, J. Salamon, Z. Jin, and J. Nam, “Disentangled multidimensional metric learning for music similarity,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6–10.
  24. ——, “Metric learning vs classification for disentangled music representation learning,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2020.
  25. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proceedings of International Conference on Learning Representations,Workshop Track Proceedings,ICLR, 2013.
  26. R. He and J. McAuley, “Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering,” in Proceedings of the 25th International Conference on World Wide Web, 2016, p. 507–517.
  27. S. Oramas, O. Nieto, F. Barbieri, and X. Serra, “Multi-label music genre classification from audio, text, and images using deep features,” in Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 2017, pp. 23–30.
  28. Y. Xian, B. Schiele, and Z. Akata, “Zero-shot learning-the good, the bad and the ugly,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4582–4591.
  29. A. Salle, A. Villavicencio, and M. Idiart, “Matrix factorization using window sampling and negative sampling for improved word representations,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).   Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 419–424. [Online]. Available: https://aclanthology.org/P16-2068
  30. M. Won, K. Choi, and X. Serra, “Semi-supervised music tagging transformer,” in Proceedings of of International Society for Music Information Retrieval, 2021.
  31. J. Choi, J. Lee, J. Park, and J. Nam, “Zero-shot learning and knowledge transfer in music classification and tagging,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2019.
  32. D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The mtg-jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2019.
  33. S. Doh, M. Won, K. Choi, and J. Nam, “Toward universal text-to-music retrieval,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  34. L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. SeungHeon Doh (18 papers)
  2. Jongpil Lee (17 papers)
  3. Dasaem Jeong (20 papers)
  4. Juhan Nam (64 papers)
Citations (1)