Ascertain whether SoundStorm’s non-autoregressive semantic-to-acoustic token conversion harms acoustic diversity

Ascertain whether replacing the autoregressive semantic-to-acoustic token conversion in AudioLM with SoundStorm’s non-autoregressive Transformer negatively affects acoustic diversity of the generated audio.

Background

SoundStorm was introduced as a non-autoregressive Transformer for converting semantic tokens to acoustic tokens within the AudioLM framework, significantly accelerating inference.

While this architectural change improves speed, the authors explicitly note uncertainty regarding its impact on acoustic diversity, motivating an empirical comparison across facets of diversity to test whether the change causes degradation in diversity.

References

However, it is unclear whether this change harms acoustic diversity of the generated audio.

MAD Speech: Measures of Acoustic Diversity of Speech  (2404.10419 - Futeral et al., 2024) in Section 7.1 (Semantic-to-Acoustic Token Conversion: SoundStorm vs. AudioLM)