Papers
Topics
Authors
Recent
Search
2000 character limit reached

Measuring Audio Prompt Adherence with Distribution-based Embedding Distances

Published 31 Mar 2024 in cs.SD and eess.AS | (2404.00775v4)

Abstract: An increasing number of generative music models can be conditioned on an audio prompt that serves as musical context for which the model is to create an accompaniment (often further specified using a text prompt). Evaluation of how well model outputs adhere to the audio prompt is often done in a model or problem specific manner, presumably because no generic evaluation method for audio prompt adherence has emerged. Such a method could be useful both in the development and training of new models, and to make performance comparable across models. In this paper we investigate whether commonly used distribution-based distances like Fr\'echet Audio Distance (FAD), can be used to measure audio prompt adherence. We propose a simple procedure based on a small number of constituents (an embedding model, a projection, an embedding distance, and a data fusion method), that we systematically assess using a baseline validation. In a follow-up experiment we test the sensitivity of the proposed audio adherence measure to pitch and time shift perturbations. The results show that the proposed measure is sensitive to such perturbations, even when the reference and candidate distributions are from different music collections. Although more experimentation is needed to answer unaddressed questions like the robustness of the measure to acoustic artifacts that do not affect the audio prompt adherence, the current results suggest that distribution-based embedding distances provide a viable way of measuring audio prompt adherence. An python/pytorch implementation of the proposed measure is publicly available as a github repository.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325.
  2. Ddx7: Differentiable fm synthesis of musical instrument sounds. arXiv preprint arXiv:2208.06169.
  3. Look, listen, and learn more: Design choices for deep audio embeddings. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3852–3856. IEEE.
  4. Singsong: Generating musical accompaniments from singing.
  5. CLAP: learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  6. Neural audio synthesis of musical notes with wavenet autoencoders.
  7. Riffusion - Stable diffusion for real-time music generation.
  8. Bassnet: A variational gated autoencoder for conditional generation of bass guitar tracks with learned interactive control. Applied Sciences, 10(18:6627). Special Issue ”Deep Learning for Applications in Acoustics: Modeling, Synthesis, and Listening”.
  9. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773.
  10. Adapting frechet audio distance for generative music evaluation.
  11. Cnn architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135. IEEE.
  12. Mulan: A joint embedding of music audio and natural language.
  13. Noise2music: Text-conditioned music generation with diffusion models.
  14. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661.
  15. Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms. In INTERSPEECH, pages 2350–2354.
  16. High-level control of drum track generation using learned patterns of rhythmic interaction. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019, New Paltz, NY, USA, October 20-23, 2019. IEEE.
  17. Audioldm: Text-to-audio generation with latent diffusion models.
  18. A common language effect size statistic. Psychological bulletin, 111(2):361–365.
  19. Comparing representations for audio synthesis using generative adversarial networks.
  20. Drumgan: Synthesis of drum sounds with timbral feature conditioning using generative adversarial networks. arXiv preprint arXiv:2008.12073.
  21. Stemgen: A music generation model that listens.
  22. Bass accompaniment generation via latent diffusion.
  23. Musika! fast infinite waveform music generation. arXiv preprint arXiv:2208.08706.
  24. MUSDB18-HQ - an uncompressed version of MUSDB18.
  25. A hierarchical latent vector model for learning long-term structure in music.
  26. Moûsai: Text-to-music generation with long-context latent diffusion.
  27. Demystifying MMD GANs. In International Conference for Learning Representations, pages 1–36.
  28. Music controlnet: Multiple time-varying controls for music generation.
  29. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  30. Jukedrummer: Conditional beat-aware audio-domain drum accompaniment generation via transformer vq-vae.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 11 likes about this paper.