Determine the acoustic diversity of outputs from generative spoken language models
Determine how diverse the outputs of generative spoken language models are across recording conditions, prosody, styles, and accents, given that prevailing evaluation metrics focus on faithfulness to inputs and do not capture variability along these facets.
References
Thus it remains unknown how diverse their outputs are, since those metrics neither capture variability nor take into account factors of the perceived speech diversity (i.e., recording conditions, prosody, styles, and accents).
— MAD Speech: Measures of Acoustic Diversity of Speech
(2404.10419 - Futeral et al., 2024) in Section 1 (Introduction)