SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models (2405.00878v1)

Published 1 May 2024 in cs.CV

Abstract: We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.

PDF HTML Abstract

SonicDiffusion: Advancements in Multimodal Image Synthesis via Audio Cues

The paper "SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models" presents an innovative approach in the domain of multimodal image synthesis by employing audio as a guiding modality within diffusion models, particularly leveraging the existing capabilities of Stable Diffusion. In contrast to the prevailing dependence on textual inputs for image generation, this research highlights the efficacy of audio cues, which can offer a more direct and natural integration with visual content. The researchers introduce SonicDiffusion, a novel framework designed to enable sound-guided image generation and editing, showcasing the versatility and enhanced contextual richness of integrating auditory inputs into the visual synthesis process.

The essence of SonicDiffusion lies in its methodological integration of audio-conditioned cross-attention layers, a mechanism that facilitates the translation of audio signals into visual representations. The paper meticulously details the architecture of the audio projector module, which converts audio features into tokens compatible with the diffusion model's workflow. This is particularly crucial as it maintains the semantic alignment necessary for effective modality translation while preserving the intrinsic structure of the pretrained model layers. Notably, SonicDiffusion operates with minimal additional trainable parameters, emphasizing efficiency in leveraging existing model capabilities.

Evaluating its performance, SonicDiffusion demonstrates superior results in synthesizing visually coherent and semantically aligned imagery with the accompanying audio inputs. Through comprehensive experiments involving three diverse datasets—Landscape + Into the Wild, Greatest Hits, and RAVDESS—the paper rigorously tests the model's ability to capture intricate details and maintain high image quality. The model's ability to produce images that accurately reflect the scene suggested by sound inputs is underscored by favorable metrics in semantic relevance (AIS, AIC, IIS) and image quality (FID).

Further distinguishing its contribution, SonicDiffusion extends its capabilities beyond audio-guided image generation to include image editing. By adapting existing feature injection techniques, the framework can modify images in response to audio cues, effectively demonstrating sound-guided editing potential. This is a significant advancement given the existing literature predominantly focuses on text-driven mechanisms.

The implications of this work are profound both practically and theoretically. Practically, SonicDiffusion offers a streamlined solution for integrating audio with visual data, broadening the scope of applications in multimedia content creation and editing. Theoretically, it challenges the prevailing emphasis on text-centric methods, advocating for a broader exploration of non-textual modalities in computational media representation.

Future research could explore the inclusion of other sensory inputs, presenting opportunities for developing even more immersive and contextually rich content creation tools. Additionally, the potential to refine and enhance the integration techniques, such as improving feature alignment or expanding cross-attention mechanisms, may lead to further refinement and expansion of multimodal model capabilities. By setting a precedent in effectively leveraging auditory information within diffusion models, SonicDiffusion paves the way for exciting new explorations in AI-driven multimedia synthesis.

PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (6)

Burak Can Biner (1 paper)
Farrin Marouf Sofian (3 papers)
Umur Berkay Karakaş (1 paper)
Duygu Ceylan (63 papers)
Erkut Erdem (45 papers)
Aykut Erdem (45 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/aykuterdemml/status/1786274596821348843

https://twitter.com/KuisAICenter/status/1791444623333654794

https://twitter.com/CSVisionPapers/status/1786316142044209242

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models (2405.00878v1)

SonicDiffusion: Advancements in Multimodal Image Synthesis via Audio Cues

Related Papers

Tweets