Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models (2408.01337v1)

Published 2 Aug 2024 in cs.SD, cs.CL, cs.LG, cs.MM, and eess.AS

Abstract: Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal LLMs focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Benno Weck (9 papers)
  2. Ilaria Manco (8 papers)
  3. Emmanouil Benetos (89 papers)
  4. Elio Quinton (15 papers)
  5. George Fazekas (28 papers)
  6. Dmitry Bogdanov (18 papers)
Citations (7)

Summary

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-LLMs

The paper "MuChoMusic: Evaluating Music Understanding in Multimodal Audio-LLMs" introduces a novel benchmark designed to rigorously evaluate the music understanding capabilities of Audio-LLMs. The proliferation of multimodal models capable of processing both audio and language inputs has significantly advanced audio understanding, but their evaluation, particularly in the music domain, remains challenging.

MuChoMusic addresses this challenge with a benchmark comprised of 1,187 multiple-choice questions, validated by human annotators, based on 644 music tracks from two publicly available datasets. The benchmark evaluates models across various music understanding dimensions, involving fundamental concepts and their cultural and functional contexts. The key contributions, methodology, and implications of this benchmark are discussed below.

Key Contributions

  1. Comprehensive Benchmark: MuChoMusic provides a structured and human-validated benchmark to evaluate the music understanding capabilities of Audio LLMs. It comprises multiple-choice questions that cover a wide range of musical genres.
  2. Diverse Coverage: The questions assess knowledge and reasoning across multiple dimensions such as music theory, historical and cultural contexts, and expressive analysis.
  3. Open Source Data: The dataset and code for MuChoMusic are open-sourced, promoting transparency and enabling further research in the field.

Methodology

Data Sources and Generation:

MuChoMusic leverages music captions from the MusicCaps and Song Describer Dataset (SDD), both of which provide detailed descriptions of music tracks. Using these captions, multiple-choice questions are generated by instructing a state-of-the-art LLM to ensure diverse and comprehensive coverage of the music understanding dimensions.

Evaluation Dimensions:

The benchmark evaluates models on two primary categories: knowledge and reasoning. Knowledge-related questions probe models' capabilities to recognize aspects such as melody, instrumentation, and structure. Reasoning questions assess models’ abilities to analyze and interpret higher-level concepts like mood, genre, and historical context.

Validation and Categorization:

The generated questions are validated by human annotators to ensure accuracy and relevance. This validation process ensures that the benchmark questions are challenging and reliable indicators of a model’s music understanding capabilities. The questions are also automatically categorized according to the defined taxonomy, which includes dimensions of knowledge and reasoning.

Experimental Results

The evaluation of five state-of-the-art open-source Audio LLMs (MusiLingo, MuLLaMa, M2UGen, SALMONN, and Qwen-Audio) using MuChoMusic reveals a general underperformance across the models. Qwen-Audio displayed comparatively superior performance with an accuracy of 51.4%, while other models showed limitations, particularly the music-specific models which performed worse than some general-audio models. The accuracy of the models was significantly affected by their instruction-following rates, where incorrect answers or failure to select any option were common issues, often attributed to auditory and language hallucinations and pre-training biases.

Analysis of Results

Sensitivity to Prompts

The paper assesses the impact of in-context learning (ICL) by varying the number of in-context examples. Findings suggest a marginal improvement when using one-shot prompts for some models, but no consistent enhancement beyond one-shot settings. This indicates that while ICL reduces variability in model outputs, it does not substantially improve accuracy in this benchmarking scenario.

Distractor Analysis

By ablation of different types of distractors, the paper identifies that "incorrect but related" (IR) distractors pose the greatest challenge, indicating models’ reliance on textual rather than audio cues. This supports the hypothesis of a significant language bias in these models.

Audio Attention Test

The audio attention test, where audio inputs are replaced with unrelated sounds or noise, confirms that the majority of the models do not attend effectively to the audio content. This suggests that the language prior strongly influences their outputs, undermining the expected multimodal integration.

Implications and Future Directions

Practical Implications:

MuChoMusic sets a new standard for evaluating music understanding in Audio LLMs, providing a reliable and accessible tool for researchers and developers. The insights gained from this benchmark can guide the refinement of model architectures and training protocols, emphasizing the need for better multimodal integration.

Theoretical Implications:

The findings stress the necessity of resolving the over-reliance on language modalities in multimodal models. Future research must focus on improving the audio processing capabilities of these models to enhance their ability to reason and understand complex musical inputs.

Speculations on Future Developments:

Continued development of benchmarks similar to MuChoMusic with iterative improvements based on community feedback could lead to more sophisticated evaluation methods. Integration of multimodal few-shot prompting and designing novel architectures specifically targeting the fusion of audio and text representations could potentially bridge the current performance gaps.

In conclusion, MuChoMusic offers a significant step forward in standardizing and challenging the evaluation of music understanding in multimodal models, paving the way for future advancements in the field.