SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning (2506.15154v1)

Published 18 Jun 2025 in cs.SD, cs.AI, cs.CL, cs.MM, and eess.AS

Abstract: Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-LLM. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

Summary

The paper introduces SonicVerse, integrating auxiliary music feature detection tasks into a captioning model using multi-task learning.
It employs a projection-based architecture combining a music encoder, multi-task projector, and a large language model to align audio inputs with language tokens.
Evaluation shows enhanced caption quality with improved BLEU, ROUGE, and BERT scores, paving the way for advanced music analysis and recommendation systems.

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Introduction

The paper "SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning" introduces a novel approach to music captioning, which stands on the foundation of multi-task learning. The authors aim to enhance the descriptiveness and quality of generated music captions by incorporating auxiliary music feature detection tasks directly into the captioning model. This approach distinguishes between "soft features" like mood and "hard features" such as key, instrumentation, and vocals, which are integrated into the captioning process. The core idea is to enrich the captioning input by projecting detected music features into language tokens alongside the audio input, leveraging a LLM for processing.

Conceptual Framework

SonicVerse is built upon a multi-task learning structure where the tasks of music feature detection are processed alongside caption generation. The novelty lies in the projection-based architecture capable of converting audio inputs into language tokens and chaining outputs using an LLM to maintain coherence over longer music durations. The paper highlights the extension of the MusicBench dataset using MIRFLEX, creating paired data of audio, captions, and musical features to train the model.

Architecture and Methodology

The model architecture consists of three main components: a music encoder (MERT), a multi-task projector, and a pre-trained LLM (Mistral-7B). The MERT encoder is praised for its ability to preserve feature information efficiently and for its performance across various music information retrieval tasks. The multi-task projector has dual pathways — one for music content and another for discrete musical attributes — with each task head predicting specific musical features essential for comprehensive caption generation. Through training, the model aligns these features with the embedding space of the LLM, ensuring that generated tokens represent both the music content and the detected attributes accurately.

Evaluation and Results

The paper evaluates the SonicVerse model using NLP metrics like BLEU, ROUGE, and BERT scores, demonstrating improvements over baselines and competitive performance against state-of-the-art models trained on open data. Interestingly, SonicVerse shows specific strength in generating captions rich in musical features like key and instrumentation compared to other models such as QWEN2-Audio and BLAP. A llustrative result is presented for "Bohemian Rhapsody," where SonicVerse excels in temporally informed caption chaining, showcasing its ability to describe the piece's evolution over time — a significant leap in music AI.

Implications and Future Direction

The integration of concrete musical features into descriptive natural language offers substantial theoretical and practical implications. Theoretically, SonicVerse could transform approaches in music information retrieval and text-to-music generation by including comprehensive musical attributes in datasets. Practically, improvements in caption quality and detail offer new opportunities for music databases, opening advanced avenues for automatic music analysis and recommendation systems.

Looking forward, the SonicVerse framework can evolve into broader applications, potentially integrating with other domains such as video and multimedia content where audiation feature-informed description is beneficial. The release of the SonicVerse model as open source promises to encourage further explorations and developments in multimodal AI applications, providing a foundation for future innovations in music captioning and beyond.

Conclusion

SonicVerse establishes a robust framework for music captioning that exemplifies how detailed feature extraction tasks enrich music description capabilities. By harmonizing music feature predictions with the natural language processing pipeline, it sets the stage for more nuanced, feature-informed description mechanisms in music AI research. The results affirm the effectiveness of multi-task learning in music captioning, paving paths for future research in AI-generated music narratives. With advances in chaining techniques and pre-trained LLMs, SonicVerse heralds a future where music description becomes seamlessly integrated into the wider AI applications ecosystem.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/HuggingPapers/status/1936093316019286381

https://twitter.com/_akhaliq/status/1936079300584984869

https://twitter.com/dorienherremans/status/1935697124391539096

https://twitter.com/ResearchBitesAI/status/1936801422546534860