Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models (2405.00878v1)

Published 1 May 2024 in cs.CV

Abstract: We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.

SonicDiffusion: Advancements in Multimodal Image Synthesis via Audio Cues

The paper "SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models" presents an innovative approach in the domain of multimodal image synthesis by employing audio as a guiding modality within diffusion models, particularly leveraging the existing capabilities of Stable Diffusion. In contrast to the prevailing dependence on textual inputs for image generation, this research highlights the efficacy of audio cues, which can offer a more direct and natural integration with visual content. The researchers introduce SonicDiffusion, a novel framework designed to enable sound-guided image generation and editing, showcasing the versatility and enhanced contextual richness of integrating auditory inputs into the visual synthesis process.

The essence of SonicDiffusion lies in its methodological integration of audio-conditioned cross-attention layers, a mechanism that facilitates the translation of audio signals into visual representations. The paper meticulously details the architecture of the audio projector module, which converts audio features into tokens compatible with the diffusion model's workflow. This is particularly crucial as it maintains the semantic alignment necessary for effective modality translation while preserving the intrinsic structure of the pretrained model layers. Notably, SonicDiffusion operates with minimal additional trainable parameters, emphasizing efficiency in leveraging existing model capabilities.

Evaluating its performance, SonicDiffusion demonstrates superior results in synthesizing visually coherent and semantically aligned imagery with the accompanying audio inputs. Through comprehensive experiments involving three diverse datasets—Landscape + Into the Wild, Greatest Hits, and RAVDESS—the paper rigorously tests the model's ability to capture intricate details and maintain high image quality. The model's ability to produce images that accurately reflect the scene suggested by sound inputs is underscored by favorable metrics in semantic relevance (AIS, AIC, IIS) and image quality (FID).

Further distinguishing its contribution, SonicDiffusion extends its capabilities beyond audio-guided image generation to include image editing. By adapting existing feature injection techniques, the framework can modify images in response to audio cues, effectively demonstrating sound-guided editing potential. This is a significant advancement given the existing literature predominantly focuses on text-driven mechanisms.

The implications of this work are profound both practically and theoretically. Practically, SonicDiffusion offers a streamlined solution for integrating audio with visual data, broadening the scope of applications in multimedia content creation and editing. Theoretically, it challenges the prevailing emphasis on text-centric methods, advocating for a broader exploration of non-textual modalities in computational media representation.

Future research could explore the inclusion of other sensory inputs, presenting opportunities for developing even more immersive and contextually rich content creation tools. Additionally, the potential to refine and enhance the integration techniques, such as improving feature alignment or expanding cross-attention mechanisms, may lead to further refinement and expansion of multimodal model capabilities. By setting a precedent in effectively leveraging auditory information within diffusion models, SonicDiffusion paves the way for exciting new explorations in AI-driven multimedia synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Self-supervised multimodal versatile networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 25–37.
  2. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations (ICLR).
  3. Instructpix2pix: Learning to follow image editing instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18392–18402.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 1877–1901.
  5. Unsupervised Cross-lingual Representation Learning at Scale. In The 58th Annual Meeting of the Association for Computational Linguistics (ACL). 8440–8451.
  6. CLAP: Learning audio concepts from natural language supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  7. ImageBind: One embedding space to bind them all. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15180–15190.
  8. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 27.
  9. AudioCLIP: Extending clip to image, text and audio. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 976–980.
  10. Prompt-to-Prompt Image Editing with Cross-Attention Control. In International Conference on Learning Representations (ICLR).
  11. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NeurIPS).
  12. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 6840–6851.
  13. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning (ICML). 2790–2799.
  14. The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion. In IEEE/CVF International Conference on Computer Vision (ICCV). 7822–7832.
  15. A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4401–4410.
  16. Imagic: Text-based real image editing with diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6007–6017.
  17. Sound-guided semantic video generation. In European Conference on Computer Vision (ECCV). Springer, 34–50.
  18. Robust Sound-Guided Image Manipulation. arXiv preprint arXiv:2208.14114 (2023).
  19. Sound-Guided Semantic Image Manipulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3377–3386.
  20. Generating Realistic Images from In-the-wild Sounds. In IEEE/CVF International Conference on Computer Vision (ICCV). 7160–7170.
  21. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597 (2023).
  22. Learning visual styles from audio-visual associations. In European Conference on Computer Vision (ECCV). Springer, 235–252.
  23. Steven R. Livingstone and Frank A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Data set]. PLoS ONE 13, 5 (2018), e0196391. https://doi.org/10.5281/zenodo.1188976
  24. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations (ICLR).
  25. Null-text inversion for editing real images using guided diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6038–6047.
  26. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.08453 (2023).
  27. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning (ICML). PMLR, 16784–16804.
  28. Visually indicated sounds. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2405–2413.
  29. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
  30. GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation. In IEEE/CVF International Conference on Computer Vision (ICCV). 23085–23096.
  31. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). 8748–8763.
  32. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67.
  33. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
  34. High-Resolution Image Synthesis With Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
  35. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI). 234–241.
  36. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 35. 36479–36494.
  37. Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6430–6440.
  38. Any-to-Any Generation via Composable Diffusion. arXiv preprint arXiv:2305.11846 (2023).
  39. Plug-and-play diffusion features for text-driven image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1921–1930.
  40. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2019).
  41. Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation. arXiv:2309.16429 [cs.LG]
  42. Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation. In Annual Conference of the International Speech Communication Association (INTERSPEECH). 5446–5450. https://doi.org/10.21437/Interspeech.2023-852
  43. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv preprint arxiv:2308.06721 (2023).
  44. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. arXiv preprint arXiv:2303.16199 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Burak Can Biner (1 paper)
  2. Farrin Marouf Sofian (3 papers)
  3. Umur Berkay Karakaş (1 paper)
  4. Duygu Ceylan (63 papers)
  5. Erkut Erdem (45 papers)
  6. Aykut Erdem (45 papers)
Citations (1)