Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generating Realistic Images from In-the-wild Sounds (2309.02405v1)

Published 5 Sep 2023 in cs.CV, cs.SD, and eess.AS

Abstract: Representing wild sounds as images is an important but challenging task due to the lack of paired datasets between sound and images and the significant differences in the characteristics of these two modalities. Previous studies have focused on generating images from sound in limited categories or music. In this paper, we propose a novel approach to generate images from in-the-wild sounds. First, we convert sound into text using audio captioning. Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound. Lastly, we propose a direct sound optimization with CLIPscore and AudioCLIP and generate images with a diffusion-based model. In experiments, it shows that our model is able to generate high quality images from wild sounds and outperforms baselines in both quantitative and qualitative evaluations on wild audio datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Taegyeong Lee (5 papers)
  2. Jeonghun Kang (4 papers)
  3. Hyeonyu Kim (2 papers)
  4. Taehwan Kim (21 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.