Evaluation of LLMs on Open-Ended Music Tasks

Develop standardized, reliable evaluation methodologies for large language models on open-ended music tasks, including automatic music captioning and musical reasoning, that enable cross-model comparison without requiring access to model logits or shared tokenization schemes.

Background

The paper notes that assessing LLMs on open-ended tasks such as music captioning and reasoning is challenging and currently not well resolved. In their setting, many baseline models do not expose logits and use incompatible tokenization, making likelihood-based metrics (e.g., perplexity) infeasible for cross-model comparison.

As a result, the authors resort to human evaluation and complementary analyses (e.g., GPT-4 judging musical detail), highlighting the need for principled, generalizable, and reproducible evaluation frameworks tailored to open-ended multimodal music tasks. Establishing such methodologies would facilitate fair comparison and progress tracking across diverse models and datasets.

References

Evaluating LLMs for open-ended tasks, such as captioning and reasoning, is an open research problem.

— LLark: A Multimodal Instruction-Following Language Model for Music (2310.07160 - Gardner et al., 2023) in Section 6.3 (Music Captioning Tasks)

Evaluation of LLMs on Open-Ended Music Tasks

Sponsor

Background

References

Related Problems