Evaluation of LLMs on Open-Ended Music Tasks
Develop standardized, reliable evaluation methodologies for large language models on open-ended music tasks, including automatic music captioning and musical reasoning, that enable cross-model comparison without requiring access to model logits or shared tokenization schemes.
References
Evaluating LLMs for open-ended tasks, such as captioning and reasoning, is an open research problem.
— LLark: A Multimodal Instruction-Following Language Model for Music
(2310.07160 - Gardner et al., 2023) in Section 6.3 (Music Captioning Tasks)