Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation (2410.17589v1)

Published 23 Oct 2024 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: Despite significant advancements in neural text-to-audio generation, challenges persist in controllability and evaluation. This paper addresses these issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024. We present an evaluation protocol combining objective metric, namely Fr\'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation. Our analysis reveals varying performance across sound categories and model architectures, with larger models generally excelling but innovative lightweight approaches also showing promise. The strong correlation between objective metrics and human ratings validates our evaluation approach. We discuss outcomes in terms of audio quality, controllability, and architectural considerations for text-to-audio synthesizers, providing direction for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Audioldm: text-to-audio generation with latent diffusion models. In Proceedings of the 40th International Conference on Machine Learning, pages 21450–21474, 2023.
  2. Text-to-audio generation using instruction guided latent diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3590–3598, 2023.
  3. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022.
  4. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, pages 13916–13932. PMLR, 2023.
  5. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019.
  6. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
  7. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020.
  8. Foley sound synthesis at the dcase 2023 challenge. In 2023 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023), 2023.
  9. A proposal for foley sound synthesis challenge, 2022.
  10. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  11. Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
  12. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020.
  13. Correlation of fréchet audio distance with human perception of environmental audio is embedding dependent. In 2024 32nd European Signal Processing Conference (EUSIPCO). IEEE, 2024.
  14. Adapting frechet audio distance for generative music evaluation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1331–1335. IEEE, 2024.
  15. B Series. Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly, 2, 2014.
  16. Sound scene synthesis with audioldm and tango2 for dcase 2024 task7. Technical report, Samsung Research China-Nanjing, Nanjing, China, July 2024.
  17. Sound scene synthesis based on gan using contrastive learning and effective time-frequency swap cross attention mechanism. Technical report, KT Corporation, Seoul, Republic of Korea, July 2024.
  18. Diffusion based sound scene synthesis for dcase challenge 2024 task 7. Technical report, University of Surrey, Guildford, United Kingdom, July 2024.
  19. Sound scene synthesis based on fine-tuned latent diffusion model for dcase challenge 2024 task 7. Technical report, Indian Institute of Technology Mandi, Kamand, Mandi, India, July 2024.
  20. Effectively unbiased fid and inception score and where to find them. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6070–6079, 2020.
  21. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 131–135. IEEE, 2017.
  22. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
  23. Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740, 2023.
  24. Tango 2: Aligning diffusion-based text-to-audio generative models through direct preference optimization. In ACM Multimedia 2024, 2024.
  25. Improving text-to-audio models with synthetic captions. arXiv preprint arXiv:2406.15487, 2024.
  26. Auffusion: Leveraging the power of diffusion and large language models for text-to-audio generation. arXiv preprint arXiv:2401.01044, 2024.
  27. Stable audio open. arXiv preprint arXiv:2407.14358, 2024.

Summary

We haven't generated a summary for this paper yet.