Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis (2311.08667v2)

Published 15 Nov 2023 in cs.SD and eess.AS

Abstract: Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ge Zhu (17 papers)
  2. Yutong Wen (3 papers)
  3. Zhiyao Duan (53 papers)
  4. Marc-André Carbonneau (16 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.