Determine the acoustic diversity of outputs from generative spoken language models

Determine how diverse the outputs of generative spoken language models are across recording conditions, prosody, styles, and accents, given that prevailing evaluation metrics focus on faithfulness to inputs and do not capture variability along these facets.

Background

The paper highlights that recent generative spoken LLMs can produce speech with varied voices, prosody, and recording conditions, but prevailing evaluation focuses on faithfulness to inputs (speaker identity and transcript). This leaves the variability of generated outputs insufficiently measured.

The authors propose MAD Speech, a suite of diversity metrics, precisely because the diversity of outputs is not captured by existing evaluation practices. The problem motivates their creation of per-facet embedding models and aggregation functions to quantify diversity along specific dimensions such as recording conditions, prosody, styles, and accents.

References

Thus it remains unknown how diverse their outputs are, since those metrics neither capture variability nor take into account factors of the perceived speech diversity (i.e., recording conditions, prosody, styles, and accents).

MAD Speech: Measures of Acoustic Diversity of Speech  (2404.10419 - Futeral et al., 2024) in Section 1 (Introduction)