Evaluating Quality of Text-to-Audio Generation Models
Establish rigorous methods and metrics for evaluating the generation quality of text-to-audio models that synthesize non-speech, non-music audio from textual descriptions, in a way that aligns with human perception.
References
Although there are established metrics available, evaluating the generation quality of these models is still an open research question.
— PAM: Prompting Audio-Language Models for Audio Quality Assessment
(2402.00282 - Deshmukh et al., 1 Feb 2024) in Section 5.1 (Text-to-Audio generation)