Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Latent CLAP Loss for Better Foley Sound Synthesis (2403.12182v1)

Published 18 Mar 2024 in eess.AS

Abstract: Foley sound generation, the art of creating audio for multimedia, has recently seen notable advancements through text-conditioned latent diffusion models. These systems use multimodal text-audio representation models, such as Contrastive Language-Audio Pretraining (CLAP), whose objective is to map corresponding audio and text prompts into a joint embedding space. AudioLDM, a text-to-audio model, was the winner of the DCASE2023 task 7 Foley sound synthesis challenge. The winning system fine-tuned the model for specific audio classes and applied a post-filtering method using CLAP similarity scores between output audio and input text at inference time, requiring the generation of extra samples, thus reducing data generation efficiency. We introduce a new loss term to enhance Foley sound generation in AudioLDM without post-filtering. This loss term uses a new module based on the CLAP mode-Latent CLAP encode-to align the latent diffusion output with real audio in a shared CLAP embedding space. Our experiments demonstrate that our method effectively reduces the Frechet Audio Distance (FAD) score of the generated audio and eliminates the need for post-filtering, thus enhancing generation efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (5)
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tornike Karchkhadze (5 papers)
  2. Hassan Salami Kavaki (2 papers)
  3. Mohammad Rasool Izadi (9 papers)
  4. Bryce Irvin (3 papers)
  5. Mikolaj Kegler (9 papers)
  6. Ari Hertz (1 paper)
  7. Shuo Zhang (256 papers)
  8. Marko Stamenovic (9 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com