Evaluating Quality of Text-to-Audio Generation Models

Establish rigorous methods and metrics for evaluating the generation quality of text-to-audio models that synthesize non-speech, non-music audio from textual descriptions, in a way that aligns with human perception.

Background

Text-to-Audio generation systems (e.g., AudioLDM, AudioGen, MelDiffusion) are commonly evaluated with objective metrics such as Fréchet Distance (FD), Fréchet Audio Distance (FAD), Inception Score (IS), and Kullback–Leibler (KL) divergence. These metrics compare generated outputs to reference distributions but may not fully capture perceptual quality.

The authors conduct human listening tests assessing overall quality (OVL) and relevance (REL) and compare PAM against established metrics. Despite available metrics, they explicitly state that determining how to evaluate generation quality remains an open research question, underscoring the need for more perceptually grounded evaluation protocols.

References

Although there are established metrics available, evaluating the generation quality of these models is still an open research question.

— PAM: Prompting Audio-Language Models for Audio Quality Assessment (2402.00282 - Deshmukh et al., 2024) in Section 5.1 (Text-to-Audio generation)

Evaluating Quality of Text-to-Audio Generation Models

Background

References

Related Problems