Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 86 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Kimi K2 160 tok/s Pro
2000 character limit reached

Music Foundation Model as Generic Booster for Music Downstream Tasks (2411.01135v3)

Published 2 Nov 2024 in cs.SD, cs.IR, cs.LG, and eess.AS

Abstract: We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (103)
  1. Musiclm: Generating music from text, 2023.
  2. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp. 1298–1312. PMLR, 2022.
  3. On the opportunities and risks of foundation models. ArXiv, 2021. URL https://crfm.stanford.edu/assets/report.pdf.
  4. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  5. Codified audio language modeling learns useful representations for music information retrieval. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy (eds.), Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, pp.  88–96, 2021. URL https://archives.ismir.net/ismir2021/paper/000010.pdf.
  6. Pac-hubert: Self-supervised music source separation via primitive auditory clustering and hidden-unit bert. arXiv preprint arXiv:2304.02160, 2023.
  7. Generative pretraining from pixels. In International conference on machine learning, pp. 1691–1703. PMLR, 2020.
  8. Reconvat: A semi-supervised automatic music transcription framework for low-resource real-world data. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  3918–3926, 2021a.
  9. Revisiting the onsets and frames model with additive attention. In Proceedings of the International Joint Conference on Neural Networks, pp.  In press. IEEE, 2021b. doi: 10.1109/SPW.2018.00014.
  10. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509.
  11. Reverse engineering of a recording mix with differentiable digital signal processing. The Journal of the Acoustical Society of America, 150(1):608–619, 2021.
  12. Simple and controllable music generation, 2023.
  13. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=ivCd8z8zR2. Featured Certification, Reproducibility Certification.
  14. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
  15. Jukebox: A generative model for music, 2020.
  16. Melody transcription via generative pre-training. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022.
  17. Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2121–2133, 2010. doi: 10.1109/TASL.2010.2042119.
  18. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1643–1654, 2010.
  19. Neural audio synthesis of musical notes with wavenet autoencoders, 2017.
  20. The Sound Demixing Challenge 2023 – music demixing track. arXiv preprint arXiv:2308.06979, 2023.
  21. Riffusion - Stable diffusion for real-time music generation. 2022. URL https://riffusion.com/about.
  22. Joachim Fritsch. High quality musical audio source separation. Master’s thesis, 2012.
  23. Mt3: Multi-task multitrack music transcription. In Proceedings of the 10th International Conference on Learning Representations, 2022.
  24. An attention mechanism for musical instrument recognition. In ISMIR 2019, 2019.
  25. Onsets and frames: Dual-objective piano transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, 2018, 2018. URL https://arxiv.org/abs/1710.11153.
  26. Masked autoencoders are scalable vision learners. In CVPR 2022, pp.  16000–16009, 2022.
  27. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM TASLP, 29:3451–3460, 2021.
  28. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
  29. Mulan: A joint embedding of music audio and natural language. In Preeti Rao, Hema A. Murthy, Ajay Srinivasamurthy, Rachel M. Bittner, Rafael Caro Repetto, Masataka Goto, Xavier Serra, and Marius Miron (eds.), Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022, Bengaluru, India, December 4-8, 2022, pp.  559–566, 2022a. URL https://archives.ismir.net/ismir2022/paper/000067.pdf.
  30. Noise2music: Text-conditioned music generation with diffusion models, 2023.
  31. Investigating self-supervised learning for speech enhancement and separation. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6837–6841, 2022b. doi: 10.1109/ICASSP43922.2022.9746303.
  32. Boosting Self-Supervised Embeddings for Speech Enhancement. In Proc. Interspeech 2022, pp.  186–190, 2022. doi: 10.21437/Interspeech.2022-10002.
  33. Rec ITU-R. Itu-r bs. 1770-2, algorithms to measure audio programme loudness and true-peak audio level. International Telecommunications Union, Geneva, 2011.
  34. Deep polyphonic adsr piano note transcription. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  246–250. IEEE, 2019.
  35. Fréchet audio distance: A metric for evaluating music enhancement algorithms, 2019.
  36. Adversarial learning for improved onsets and frames music transcription. International Society forMusic Information Retrieval Conference, pp.  670–677, 2019.
  37. Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representation (ICLR), 2015.
  38. Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR); 2015 Oct 26-30; Málaga, Spain.[Málaga]: International Society for Music Information Retrieval, 2015. p. 364-70. International Society for Music Information Retrieval (ISMIR), 2015.
  39. High-resolution piano transcription with pedals by regressing onset and offset times. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021.
  40. Music mixing style transfer: A contrastive learning approach to disentangle audio effects. 2023.
  41. End-to-end musical key estimation using a convolutional neural network. In 2017 25th European Signal Processing Conference (EUSIPCO), pp.  966–970. IEEE, 2017.
  42. Efficient training of audio transformers with patchout. In Proc. of Interspeech 2022, 2022.
  43. Efficient neural music generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=cxazQGSsQa.
  44. Evaluation of algorithms using games: The case of music tagging. In Keiji Hirata, George Tzanetakis, and Kazuyoshi Yoshii (eds.), Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009, Kobe International Conference Center, Kobe, Japan, October 26-30, 2009, pp.  387–392. International Society for Music Information Retrieval, 2009. URL http://ismir2009.ismir.net/proceedings/OS5-5.pdf.
  45. Autoregressive image generation using residual quantization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  11523–11532, 2022.
  46. Context aware intelligent mixing systems. Journal of the Audio Engineering Society, 2021.
  47. Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2):522–535, 2019. doi: 10.1109/TMM.2018.2856090.
  48. Jen-1: Text-guided universal music generation with omnidirectional diffusion models, 2023a.
  49. Mert: Acoustic music understanding model with large-scale self-supervised training, 2023b.
  50. Audioldm: Text-to-audio generation with latent diffusion models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  51. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/CorpusID:53592270.
  52. Music source separation with band-split rope transformer. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  481–485, 2024. doi: 10.1109/ICASSP48485.2024.10446843.
  53. Intelligent multitrack dynamic range compression. Journal of the Audio Engineering Society, 63(6):412–426, 2015.
  54. Deep learning for black-box modeling of audio effects. Applied Sciences, 10(2):638, 2020.
  55. A deep learning approach to intelligent drum mixing with the Wave-U-Net. Journal of the Audio Engineering Society, 2021.
  56. Automatic music mixing with deep learning and out-of-domain data. In ISMIR, 2022.
  57. Supervised and unsupervised learning of audio representations for music understanding. In ISMIR 2022, 2022.
  58. Conditioned-u-net: Introducing a control mechanism in the u-net for multiple source separations. 2019.
  59. Music Demixing Challenge 2021. Frontiers in Signal Processing, 1:808395, 2022.
  60. Transfer learning with deep neural embeddings for music classification tasks. In Leszek Rutkowski, Rafał Scherer, Marcin Korytkowski, Witold Pedrycz, Ryszard Tadeusiewicz, and Jacek M. Zurada (eds.), Artificial Intelligence and Soft Computing, pp.  72–81, Cham, 2023. Springer International Publishing. ISBN 978-3-031-23492-7.
  61. Approaches in intelligent music production. In Arts, volume 8, pp.  125. MDPI, 2019.
  62. Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation. In HEAR: Holistic Evaluation of Audio Representations, pp. 1–24. PMLR, 2022.
  63. Multichannel music separation with deep neural networks. In Proc. of 24th European Signal Processing Conference (EUSIPCO), pp.  1748–1752, 2016.
  64. Specaugment: A simple data augmentation method for automatic speech recognition. In Proc. of Interspeech 2019, 2019.
  65. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  66. Geoffroy Peeters. A large set of audio features for sound description (similarity and classification) in the CUIDADO project. Analysis/Synthesis Team. IRCAM, Paris, France, 54(0):1–25, 2004.
  67. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  68. musicnn: Pre-trained convolutional neural networks for music audio tagging. arXiv preprint arXiv:1909.06654, 2019.
  69. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
  70. Mir_eval: A transparent implementation of common mir metrics. In ISMIR, volume 10, pp.  2014, 2014.
  71. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2019. URL https://api.semanticscholar.org/CorpusID:204838007.
  72. Musdb18-a corpus for music separation. 2017.
  73. Generating diverse high-fidelity images with VQ-VAE-2. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  14866–14876, 2019.
  74. Hybrid transformers for music source separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  75. 1000 songs for emotional analysis of music. In Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia, pp.  1–6, 2013.
  76. Contrastive learning of musical representations. arXiv preprint arXiv:2103.09410, 2021.
  77. Intelligent Music Production. Focal Press, 2019.
  78. Automatic multitrack mixing with a differentiable mixing console of neural audio effects. In ICASSP. IEEE, 2021.
  79. Deep Learning for Automatic Mixing. ISMIR, December 2022. URL https://dl4am.github.io/tutorial.
  80. Open-unmix - a reference implementation for music source separation. Journal of Open Source Software, 2019. doi: 10.21105/joss.01667. URL https://doi.org/10.21105/joss.01667.
  81. Li Su and Yi-Hsuan Yang. Escaping from the abyss of manual annotation: New methodology of building polyphonic datasets for automatic music transcription. In Music, Mind, and Embodiment: 11th International Symposium, CMMR 2015, Plymouth, UK, June 16-19, 2015, Revised Selected Papers 11, pp. 309–321. Springer, 2016.
  82. SQ-VAE: Variational bayes on discrete representation with self-annealed stochastic quantization. In International Conference on Machine Learning, 2022.
  83. Hq-vae: Hierarchical discrete representation learning with variational bayes. Transactions on Machine Learning Research (TMLR), 2024.
  84. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805.
  85. Learning features of music from scratch. In ICLR, volume abs/1611.09827, 2016.
  86. Automatic music transcription with hierarchical frequency-time transformer. In Proceedings of the 24th International Society for Music Information Retrieval Conference, pp.  215–222, 2023.
  87. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing, 10(5):293–302, 2002.
  88. Stereo panning features for classifying recording production style. In ISMIR, pp.  441–444, 2007.
  89. Improving music source separation based on deep neural networks through data augmentation and network blending. In Proc. of IEEE ICASSP, pp.  261–265, 2017. doi: 10.1109/ICASSP.2017.7952158.
  90. Neural discrete representation learning. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  6306–6315, 2017.
  91. Adoption of ai technology in the music mixing workflow: An investigation. 2023.
  92. Diff-MST: Differentiable mixing style transfer. In ISMIR, 2024.
  93. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  94. Towards learning universal audio representations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  4593–4597. IEEE, 2022.
  95. Vocalset: A singing voice dataset. In ISMIR, pp.  468–474, 2018.
  96. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
  97. Guitarset: A dataset for guitar transcription. In ISMIR, pp.  453–460, 2018.
  98. Deformable CNN and imbalance-aware feature learning for singing technique classification. In Hanseok Ko and John H. L. Hansen (eds.), Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pp.  2778–2782. ISCA, 2022. doi: 10.21437/INTERSPEECH.2022-11137. URL https://doi.org/10.21437/Interspeech.2022-11137.
  99. Uniaudio: An audio foundation model toward universal audio generation, 2023.
  100. Marble: Music audio representation benchmark for universal evaluation. arXiv preprint arXiv:2306.10548, 2023.
  101. Soundstream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech, Lang. Process., 30:495–507, 2022.
  102. An attention-based approach to hierarchical multi-label music instrument classification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023a.
  103. Extending audio masked autoencoders toward audio restoration. In 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.  1–5, 2023b. doi: 10.1109/WASPAA58266.2023.10248171.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates a novel two-stage architecture using an HQ-VAE and an auto-regressive model to capture both coarse and fine musical features.
  • The approach significantly improves downstream tasks such as music tagging, transcription, source separation, and mixing with measurable gains in metrics like F1 scores and SDR.
  • The paper shows that SoniDo outperforms comparable models like Jukebox and MusicGen, indicating its potential as a resource-efficient booster for diverse music applications.

An Examination of SoniDo: A Music Foundation Model for Boosting Downstream Tasks

The paper titled "Music Foundation Model as Generic Booster for Music Downstream Tasks" introduces a unique approach to enhancing various music-related downstream tasks using a foundation model named SoniDo. This model represents a significant effort to leverage large-scale pre-trained models to improve the performance of task-specific models in music applications such as music tagging, transcription, source separation, and mixing. Unlike previous models which have been primarily developed for language tasks, SoniDo is specifically aimed at music applications, filling a crucial void in the discipline.

The core of SoniDo's architecture features a two-stage model: an HQ-VAE (Hierarchically Quantized Variational Autoencoder) for learning hierarchical feature representations, and an auto-regressive model for generating probabilities from these features. This architecture is designed to decode music input into a combination of coarse and fine features, capturing a broad range of music information across varying levels of abstraction.

Key Contributions and Results

  1. Hierarchical Feature Extraction: SoniDo demonstrates a stratified structure where different levels of granularity are captured using hierarchical encoders. This aligns closely with human-perceptible music characteristics, such as rhythm and melody. The hierarchical representation promotes efficient adaptation of these features for a diverse range of tasks.
  2. Evaluation on Multiple Tasks: SoniDo's effectiveness was rigorously evaluated across several downstream tasks, demonstrating improvements in music tagging, transcription, source separation, and mixing. The injected features were shown to boost the underlying task-specific models significantly, highlighting their utility as generic performance enhancers.
  3. Empirical Performance: The paper reports quantitative performance improvements (e.g., increased F1 scores in transcription and better SDR in source separation) when injecting SoniDo features into task-specific pipelines. Such enhancements were notably significant when dealing with data scarcity, suggesting that SoniDo's features can provide valuable context and structure that improve model predictions with limited data.
  4. Comparison with Contemporary Models: When compared to other foundation models like Jukebox and MusicGen, SoniDo's hierarchical approach provided competitive or superior performance in many benchmark tasks, proving that the hierarchical encapsulation of musical features holds promise for broader task applications beyond its original training setup.

Implications and Future Directions

The theoretical implications of SoniDo are manifold. It signifies a transition in the design of music processing systems from narrowly focused models to more universal frameworks capable of accommodating a wide array of tasks using a shared set of features. Such a model can greatly simplify the transfer learning process within the domain of music AI, enabling more accessible music processing solutions.

Practically, SoniDo offers a substantial advancement in music AI. By effectively reusing pre-trained music models, developers can economize resources while enhancing software toolkits used in music production environments. This paper might encourage further research into hierarchical modeling approaches and offer an impetus for developing enhanced auto-regressive models capable of spanning diverse domains, including non-music applications.

Moving forward, it will be essential to expand SoniDo's capabilities to handle non-western music and robust commercial datasets, ensuring that its efficacy holds across global music paradigms. Further refinement of its hierarchical feature extraction could also enable more nuanced understanding and generation of music, aligning model outputs even closer to human interpretations of musical richness and diversity. In anticipating these advancements, SoniDo lays the groundwork for a new era of AI-driven music creativity and analysis.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.