Emotion-Aligned Contrastive Learning Between Images and Music (2308.12610v3)
Abstract: Traditional music search engines rely on retrieval methods that match natural language queries with music metadata. There have been increasing efforts to expand retrieval methods to consider the audio characteristics of music itself, using queries of various modalities including text, video, and speech. While most approaches aim to match general music semantics to the input queries, only a few focus on affective qualities. In this work, we address the task of retrieving emotionally-relevant music from image queries by learning an affective alignment between images and music audio. Our approach focuses on learning an emotion-aligned joint embedding space between images and music. This embedding space is learned via emotion-supervised contrastive learning, using an adapted cross-modal version of the SupCon loss. We evaluate the joint embeddings through cross-modal retrieval tasks (image-to-music and music-to-image) based on emotion labels. Furthermore, we investigate the generalizability of the learned music embeddings via automatic music tagging. Our experiments show that the proposed approach successfully aligns images and music, and that the learned embedding space is effective for cross-modal retrieval applications.
- “Multimodal metric learning for tag-based music retrieval,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2021.
- “Contrastive audio-language learning for music,” in Proc. of 23rd International Society for Music Information Retrieval Conference, 2022.
- “Toward universal text-to-music retrieval,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Query by video: Cross-modal music retrieval,” in Proceedings of the 20th International Society for Music Information Retrieval Conference, 2019.
- “It’s time for artistic correspondence in music and video,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022.
- “Ssvmr: Saliency-based self-training for video-music retrieval,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Emotion embedding spaces for matching music to stories,” in Proc. of the 22nd International Society for Music Information Retrieval Conference, 2021.
- “Textless speech-to-music retrieval using emotion similarity,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Learning transferable visual models from natural language supervision,” in Proc. of the 38th International Conference on Machine Learning, 2021.
- “Audioclip: Extending clip to image, text and audio,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2022.
- “Wav2clip: Learning robust audio representations from clip,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2022.
- “Clap: Learning audio concepts from natural language supervision,” arXiv:2206.04769, 2022.
- “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Mulan: A joint embedding of music audio and natural language,” in Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022.
- “On the role of visual context in enriching music representations,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Evaluation of cnn-based automatic music tagging models,” in Proc. of 17th Sound and Music Computing, 2020.
- “Data-driven harmonic filters for audio representation learning,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2020.
- “The million song dataset,” in Proceedings of the 12th International Society for Music Information Retrieval Conference, 2011.
- “Building a large scale dataset for image emotion recognition: The fine print and the benchmark,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- “Audio set: An ontology and human-labeled dataset for audio events,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2017.
- “Supervised contrastive learning,” in Advances in Neural Information Processing Systems, 2020.
- “Decoupled weight decay regularization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- “Iemocap: Interactive emotional dyadic motion capture database,” in Language Resources and Evaluation, 2008.
- “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” in PLOS One, 2018.
- “Hi,kia: A speech emotion recognition dataset for wake-up words,” in Proc. 14th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2022.
- “Contrastive learning of musical representations,” in Proceedings of the 22nd International Society for Music Information Retrieval Conference, 2021.
- “Evaluation of algorithms using games: The case of music tagging,” in Proc. of the 10th International Society for Music Information Retrieval Conference, 2009.
- Shanti Stewart (4 papers)
- Kleanthis Avramidis (17 papers)
- Tiantian Feng (61 papers)
- Shrikanth Narayanan (151 papers)