Joint Music and Language Attention Models for Zero-shot Music Tagging (2310.10159v1)
Abstract: Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B. We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the JMLA models. Our proposed JMLA system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.
- “A survey of audio-based music classification and annotation,” IEEE Transactions on Multimedia, vol. 13, no. 2, pp. 303–319, 2010.
- “Convolutional recurrent neural networks for music classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2392–2396.
- “Automatic tagging using deep convolutional neural networks,” in International Society of Music Information Retrieval (ISMIR), 2016.
- “Evaluation of CNN-based automatic music tagging models,” in Sound and Music Computing Conference (SMC), 2020.
- “Semi-supervised music tagging transformer,” in International Society for Music Information Retrieval (ISMIR), 2021.
- “MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training,” arXiv preprint arXiv:2306.00107, 2023.
- “Open set recognition for music genre classification,” arXiv preprint arXiv:2209.07548, 2022.
- “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
- “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023.
- “Flamingo: a visual language model for few-shot learning,” in Advances in Neural Information Processing Systems, 2022.
- “Pengi: An audio language model for audio tasks,” arXiv preprint arXiv:2305.11834, 2023.
- “MuLan: A joint embedding of music audio and natural language,” in International Society for Music Information Retrieval Conference (ISMIR), 2022.
- OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
- “Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation,” in Proceedings of Machine Learning Research, 2022, vol. 166.
- “Perceiver: General perception with iterative attention,” in International Conference on Machine Learning (ICML), 2021, pp. 4651–4664.
- “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
- “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” in Association for Computational Linguistics, 2022.
- “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
- “Falcon-40B: an open large language model with state-of-the-art performance,” 2023.
- “Musical genre classification of audio signals,” IEEE Transactions on speech and audio processing, vol. 10, no. 5, pp. 293–302, 2002.
- “FMA: A dataset for music analysis,” arXiv preprint arXiv:1612.01840, 2016.
- “Evaluation of algorithms using games: The case of music tagging.,” in International Society for Music Information Retrieval Conference (ISMIR), 2009, pp. 387–392.
- “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 646–650.
- Xingjian Du (25 papers)
- Zhesong Yu (6 papers)
- Jiaju Lin (11 papers)
- Bilei Zhu (11 papers)
- Qiuqiang Kong (86 papers)