Sound Scene Synthesis at the DCASE 2024 Challenge (2501.08587v1)

Published 15 Jan 2025 in cs.AI, cs.SD, and eess.AS

Abstract: This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis. Recent advances in sound synthesis and generative models have enabled the creation of realistic and diverse audio content. We introduce a standardized evaluation framework for comparing different sound scene synthesis systems, incorporating both objective and subjective metrics. The challenge attracted four submissions, which are evaluated using the Fr\'echet Audio Distance (FAD) and human perceptual ratings. Our analysis reveals significant insights into the current capabilities and limitations of sound scene synthesis systems, while also highlighting areas for future improvement in this rapidly evolving field.

Summary

The paper introduces a novel text-to-audio synthesis task that generates realistic 4-second environmental sound scenes from detailed textual prompts.
It employs AudioLDM with a meticulously curated dataset of 310 audio-caption pairs to differentiate foreground and background sounds.
Experimental results reveal a 36% performance gap compared to reference audio, raising questions about the robustness of current perceptual metrics like FAD.

Sound Scene Synthesis at the DCASE 2024 Challenge

The paper "Sound Scene Synthesis at the DCASE 2024 Challenge" provides an in-depth exploration of Task 7, a text-to-audio generation task, within the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 challenge. The central focus of this task is the realistic synthesis of environmental audio based on textual descriptions, leveraging recent advances in sound synthesis and generative models.

Challenge Overview and Task Definition

Task 7 is built upon the need for systems to generate 4-second realistic audio snippets at a 32 kHz sampling rate, considering environmental contexts specified in the prompts. A crucial distinction from previous iterations is the text-based prompting structure allowing detailed foreground-background sound relationships without permitting music or intelligible speech in the output. This requirement emphasizes generative models, highlighting the demand for novel sound synthesis as opposed to retrieval techniques.

Methodology: Dataset and Baseline Systems

The dataset for this challenge includes 310 audio-caption pairs meticulously curated to ensure high-quality sound scene representations—60 samples for development and 250 for evaluation. Sounds were categorized into 'foreground' and 'background', with foreground sounds including Animal, Vehicle, Human, Alarm, Tool, and Entrance sounds and background categories including Crowd, Traffic, Water, Birds, and Room Tone. This meticulous labeling supports the focused synthesis of sounds with perceived naturalness and clarity.

AudioLDM serves as the primary baseline system, integrating diverse datasets such as AudioCaps, AudioSet, and others. This comprehensive approach ensures exposure to a wide variety of sound types, enhancing the model's adaptability and synthesis capability across numerous environments.

Evaluation Metrics

Evaluation metrics combine both objective and subjective assessments. The objective metric, Fréchet Audio Distance (FAD), utilizes PANN-Wavegram-Logmel embeddings to quantify the alignment of generated sounds with reference distributions. On the subjective end, three metrics were used: Foreground Fit (FF), Background Fit (BF), and Audio Quality (AQ), all contributing to a composite perceptual score with an emphasis on foreground sound representation.

Results and Analysis

The challenge evaluated four submitted systems alongside the baseline, revealing notable gaps and correlations. A crucial observation was a 36% performance discrepancy between synthetic system outputs and professionally engineered sound references. High inter-rater agreement (Cronbach's alpha = 0.959) underscored the reliability of subjective assessments, although a limited number of submissions were a noted limitation in deriving broad conclusions. Despite a noticeable correlation between objective FAD scores and subjective evaluations, the analysis questions the robustness of FAD as a universal perceptual metric due to the limited data points.

Participation and Future Implications

Participation was notably lower than preceding years, attributed possibly to an expanded task scope favoring teams with sizable pre-existing models and increased complexity in evaluation criteria. The evolution of generative audio, characterized by a broadened scope and increased specialization, challenges the continual adaptation of the task framework.

Conclusion and Future Directions

While the challenge successfully demonstrates current capabilities in sound scene synthesis, the substantial quality gap to reference audio suggests avenues for development. Key recommended areas for future exploration encompass supporting more complex scene structures, enhancing training methodologies, and evolving evaluation mechanisms to more comprehensively capture sound scene coherence and perceptual quality metrics.

The decision to discontinue the DCASE challenge in 2025, due to logistical and evaluative challenges, reflects the complexity and evolving nature of the domain. Nonetheless, the foundational frameworks and insights gleaned from the DCASE 2024 challenge provide substantial contributions to the path forward in auditory synthesis research, advocating for advancements that increasingly align synthetic outputs with nuanced, human-like sound scene replication.

PDF Markdown

Related Papers

Tweets

https://twitter.com/keunwoochoi/status/1879783342691979429