Dice Question Streamline Icon: https://streamlinehq.com

Evaluating Responses to Complex Open-Ended Music Reasoning Questions

Develop rigorous methodologies and benchmarks to evaluate the quality of responses produced by multimodal audio–text language models to complex, open-ended musical reasoning questions.

Information Square Streamline Icon: https://streamlinehq.com

Background

The authors observe that straightforward comparisons are unreliable for evaluating reasoning capabilities, especially as models can produce hallucinated or generic outputs not grounded in the audio. Non-expert human raters may be susceptible to these issues, complicating evaluation.

They design alternative experiments (audio-to-text matching and GPT-4 judgments of musical detail) to mitigate these problems, but emphasize that establishing robust, widely accepted methodologies and benchmarks for complex, open-ended music reasoning remains an unresolved challenge.

References

Evaluating the quality of a models' responses to complex, open-ended questions is an open and unresolved research challenge.

LLark: A Multimodal Instruction-Following Language Model for Music (2310.07160 - Gardner et al., 2023) in Section 6.4 (Reasoning Tasks)