Music Foundation Model as Generic Booster for Music Downstream Tasks (2411.01135v3)
Abstract: We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
- Musiclm: Generating music from text, 2023.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp. 1298–1312. PMLR, 2022.
- On the opportunities and risks of foundation models. ArXiv, 2021. URL https://crfm.stanford.edu/assets/report.pdf.
- Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
- Codified audio language modeling learns useful representations for music information retrieval. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy (eds.), Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, pp. 88–96, 2021. URL https://archives.ismir.net/ismir2021/paper/000010.pdf.
- Pac-hubert: Self-supervised music source separation via primitive auditory clustering and hidden-unit bert. arXiv preprint arXiv:2304.02160, 2023.
- Generative pretraining from pixels. In International conference on machine learning, pp. 1691–1703. PMLR, 2020.
- Reconvat: A semi-supervised automatic music transcription framework for low-resource real-world data. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 3918–3926, 2021a.
- Revisiting the onsets and frames model with additive attention. In Proceedings of the International Joint Conference on Neural Networks, pp. In press. IEEE, 2021b. doi: 10.1109/SPW.2018.00014.
- Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509.
- Reverse engineering of a recording mix with differentiable digital signal processing. The Journal of the Acoustical Society of America, 150(1):608–619, 2021.
- Simple and controllable music generation, 2023.
- High fidelity neural audio compression. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=ivCd8z8zR2. Featured Certification, Reproducibility Certification.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
- Jukebox: A generative model for music, 2020.
- Melody transcription via generative pre-training. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022.
- Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2121–2133, 2010. doi: 10.1109/TASL.2010.2042119.
- Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1643–1654, 2010.
- Neural audio synthesis of musical notes with wavenet autoencoders, 2017.
- The Sound Demixing Challenge 2023 – music demixing track. arXiv preprint arXiv:2308.06979, 2023.
- Riffusion - Stable diffusion for real-time music generation. 2022. URL https://riffusion.com/about.
- Joachim Fritsch. High quality musical audio source separation. Master’s thesis, 2012.
- Mt3: Multi-task multitrack music transcription. In Proceedings of the 10th International Conference on Learning Representations, 2022.
- An attention mechanism for musical instrument recognition. In ISMIR 2019, 2019.
- Onsets and frames: Dual-objective piano transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, 2018, 2018. URL https://arxiv.org/abs/1710.11153.
- Masked autoencoders are scalable vision learners. In CVPR 2022, pp. 16000–16009, 2022.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM TASLP, 29:3451–3460, 2021.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
- Mulan: A joint embedding of music audio and natural language. In Preeti Rao, Hema A. Murthy, Ajay Srinivasamurthy, Rachel M. Bittner, Rafael Caro Repetto, Masataka Goto, Xavier Serra, and Marius Miron (eds.), Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022, Bengaluru, India, December 4-8, 2022, pp. 559–566, 2022a. URL https://archives.ismir.net/ismir2022/paper/000067.pdf.
- Noise2music: Text-conditioned music generation with diffusion models, 2023.
- Investigating self-supervised learning for speech enhancement and separation. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6837–6841, 2022b. doi: 10.1109/ICASSP43922.2022.9746303.
- Boosting Self-Supervised Embeddings for Speech Enhancement. In Proc. Interspeech 2022, pp. 186–190, 2022. doi: 10.21437/Interspeech.2022-10002.
- Rec ITU-R. Itu-r bs. 1770-2, algorithms to measure audio programme loudness and true-peak audio level. International Telecommunications Union, Geneva, 2011.
- Deep polyphonic adsr piano note transcription. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246–250. IEEE, 2019.
- Fréchet audio distance: A metric for evaluating music enhancement algorithms, 2019.
- Adversarial learning for improved onsets and frames music transcription. International Society forMusic Information Retrieval Conference, pp. 670–677, 2019.
- Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representation (ICLR), 2015.
- Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR); 2015 Oct 26-30; Málaga, Spain.[Málaga]: International Society for Music Information Retrieval, 2015. p. 364-70. International Society for Music Information Retrieval (ISMIR), 2015.
- High-resolution piano transcription with pedals by regressing onset and offset times. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021.
- Music mixing style transfer: A contrastive learning approach to disentangle audio effects. 2023.
- End-to-end musical key estimation using a convolutional neural network. In 2017 25th European Signal Processing Conference (EUSIPCO), pp. 966–970. IEEE, 2017.
- Efficient training of audio transformers with patchout. In Proc. of Interspeech 2022, 2022.
- Efficient neural music generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=cxazQGSsQa.
- Evaluation of algorithms using games: The case of music tagging. In Keiji Hirata, George Tzanetakis, and Kazuyoshi Yoshii (eds.), Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009, Kobe International Conference Center, Kobe, Japan, October 26-30, 2009, pp. 387–392. International Society for Music Information Retrieval, 2009. URL http://ismir2009.ismir.net/proceedings/OS5-5.pdf.
- Autoregressive image generation using residual quantization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11523–11532, 2022.
- Context aware intelligent mixing systems. Journal of the Audio Engineering Society, 2021.
- Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2):522–535, 2019. doi: 10.1109/TMM.2018.2856090.
- Jen-1: Text-guided universal music generation with omnidirectional diffusion models, 2023a.
- Mert: Acoustic music understanding model with large-scale self-supervised training, 2023b.
- Audioldm: Text-to-audio generation with latent diffusion models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/CorpusID:53592270.
- Music source separation with band-split rope transformer. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 481–485, 2024. doi: 10.1109/ICASSP48485.2024.10446843.
- Intelligent multitrack dynamic range compression. Journal of the Audio Engineering Society, 63(6):412–426, 2015.
- Deep learning for black-box modeling of audio effects. Applied Sciences, 10(2):638, 2020.
- A deep learning approach to intelligent drum mixing with the Wave-U-Net. Journal of the Audio Engineering Society, 2021.
- Automatic music mixing with deep learning and out-of-domain data. In ISMIR, 2022.
- Supervised and unsupervised learning of audio representations for music understanding. In ISMIR 2022, 2022.
- Conditioned-u-net: Introducing a control mechanism in the u-net for multiple source separations. 2019.
- Music Demixing Challenge 2021. Frontiers in Signal Processing, 1:808395, 2022.
- Transfer learning with deep neural embeddings for music classification tasks. In Leszek Rutkowski, Rafał Scherer, Marcin Korytkowski, Witold Pedrycz, Ryszard Tadeusiewicz, and Jacek M. Zurada (eds.), Artificial Intelligence and Soft Computing, pp. 72–81, Cham, 2023. Springer International Publishing. ISBN 978-3-031-23492-7.
- Approaches in intelligent music production. In Arts, volume 8, pp. 125. MDPI, 2019.
- Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation. In HEAR: Holistic Evaluation of Audio Representations, pp. 1–24. PMLR, 2022.
- Multichannel music separation with deep neural networks. In Proc. of 24th European Signal Processing Conference (EUSIPCO), pp. 1748–1752, 2016.
- Specaugment: A simple data augmentation method for automatic speech recognition. In Proc. of Interspeech 2019, 2019.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Geoffroy Peeters. A large set of audio features for sound description (similarity and classification) in the CUIDADO project. Analysis/Synthesis Team. IRCAM, Paris, France, 54(0):1–25, 2004.
- Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- musicnn: Pre-trained convolutional neural networks for music audio tagging. arXiv preprint arXiv:1909.06654, 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Mir_eval: A transparent implementation of common mir metrics. In ISMIR, volume 10, pp. 2014, 2014.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2019. URL https://api.semanticscholar.org/CorpusID:204838007.
- Musdb18-a corpus for music separation. 2017.
- Generating diverse high-fidelity images with VQ-VAE-2. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 14866–14876, 2019.
- Hybrid transformers for music source separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- 1000 songs for emotional analysis of music. In Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia, pp. 1–6, 2013.
- Contrastive learning of musical representations. arXiv preprint arXiv:2103.09410, 2021.
- Intelligent Music Production. Focal Press, 2019.
- Automatic multitrack mixing with a differentiable mixing console of neural audio effects. In ICASSP. IEEE, 2021.
- Deep Learning for Automatic Mixing. ISMIR, December 2022. URL https://dl4am.github.io/tutorial.
- Open-unmix - a reference implementation for music source separation. Journal of Open Source Software, 2019. doi: 10.21105/joss.01667. URL https://doi.org/10.21105/joss.01667.
- Li Su and Yi-Hsuan Yang. Escaping from the abyss of manual annotation: New methodology of building polyphonic datasets for automatic music transcription. In Music, Mind, and Embodiment: 11th International Symposium, CMMR 2015, Plymouth, UK, June 16-19, 2015, Revised Selected Papers 11, pp. 309–321. Springer, 2016.
- SQ-VAE: Variational bayes on discrete representation with self-annealed stochastic quantization. In International Conference on Machine Learning, 2022.
- Hq-vae: Hierarchical discrete representation learning with variational bayes. Transactions on Machine Learning Research (TMLR), 2024.
- Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805.
- Learning features of music from scratch. In ICLR, volume abs/1611.09827, 2016.
- Automatic music transcription with hierarchical frequency-time transformer. In Proceedings of the 24th International Society for Music Information Retrieval Conference, pp. 215–222, 2023.
- Musical genre classification of audio signals. IEEE Transactions on speech and audio processing, 10(5):293–302, 2002.
- Stereo panning features for classifying recording production style. In ISMIR, pp. 441–444, 2007.
- Improving music source separation based on deep neural networks through data augmentation and network blending. In Proc. of IEEE ICASSP, pp. 261–265, 2017. doi: 10.1109/ICASSP.2017.7952158.
- Neural discrete representation learning. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 6306–6315, 2017.
- Adoption of ai technology in the music mixing workflow: An investigation. 2023.
- Diff-MST: Differentiable mixing style transfer. In ISMIR, 2024.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Towards learning universal audio representations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4593–4597. IEEE, 2022.
- Vocalset: A singing voice dataset. In ISMIR, pp. 468–474, 2018.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
- Guitarset: A dataset for guitar transcription. In ISMIR, pp. 453–460, 2018.
- Deformable CNN and imbalance-aware feature learning for singing technique classification. In Hanseok Ko and John H. L. Hansen (eds.), Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pp. 2778–2782. ISCA, 2022. doi: 10.21437/INTERSPEECH.2022-11137. URL https://doi.org/10.21437/Interspeech.2022-11137.
- Uniaudio: An audio foundation model toward universal audio generation, 2023.
- Marble: Music audio representation benchmark for universal evaluation. arXiv preprint arXiv:2306.10548, 2023.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech, Lang. Process., 30:495–507, 2022.
- An attention-based approach to hierarchical multi-label music instrument classification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023a.
- Extending audio masked autoencoders toward audio restoration. In 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5, 2023b. doi: 10.1109/WASPAA58266.2023.10248171.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.