Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners (2402.17723v1)

Published 27 Feb 2024 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/

Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Introduction

The paper addresses the challenge of open-domain visual-audio generation, aiming to create synchronized video and audio content. This task has significant implications for content creation, enhancing multimedia experiences across various domains. The authors navigate the complexities of generating multimodal content by leveraging pre-existing, high-performance, single-modality generation models. They introduce an innovative approach that unifies these models through a shared latent representation space, facilitated by a Multimodality Latent Aligner built upon the ImageBind model. This work stands out by offering a versatile and resource-efficient solution to the joint visual-audio generation problem, showcasing notable improvements over existing methods.

Methods

Problem Formulation

The authors propose an optimization framework that integrates different modalities into a coherent generation process without requiring large-scale dataset training for new modalities. The process hinges on the concept of a Diffusion Latent Aligner, which uses the shared embedding space of ImageBind to guide the generation towards alignment with input conditions. This aligner acts during the denoising steps of the diffusion process, modifying latent variables to ensure compatibility between generated video and audio, or between any input and target modalities.

Diffusion Latent Aligner

The core of their method, the Diffusion Latent Aligner, operates by injecting alignment information during the generative process. It achieves this by measuring the distance between the generated content and the input condition within the ImageBind embedding space, then using this distance as feedback to adjust the generation trajectory. This approach represents a significant technical innovation, as it directly leverages the multimodal nature of the ImageBind model without additional resource-intensive retraining.

Experiments

The authors conduct comprehensive experiments to validate their framework, covering scenarios like video-to-audio, audio-to-video, joint video-audio generation, and image-to-audio generation. Through these experiments, the framework demonstrated its superiority in generating aligned and high-quality multimodal content. The results show significant improvements in benchmarks such as Frechet Video Distance (FVD), Kernel Video Distance (KVD), audio-video alignment (AV-align), among others, indicating enhanced fidelity and semantic coherence in generated content.

Discussion and Future Directions

Implications

This research introduces an elegant solution to multimodal content generation, offering tangible improvements in alignment and quality. The approach benefits from avoiding the training of new, large models by intelligently leveraging existing resources, presenting a cost-effective and flexible methodology for visual-audio generation tasks.

Limitations and Future Work

While the framework achieves impressive performance, it inherits limitations from the base generative models it employs. Thus, future enhancements in these foundational models could further elevate performance. Additionally, exploring the application of this method in generating content for more modalities or in more constrained or specific domains could yield fruitful research avenues.

Conclusion

This paper presents a novel framework for open-domain visual-audio content generation that bridges the gap between pre-existing single-modality models through a shared, multimodal latent space. The approach demonstrates significant advancements in generating semantically aligned and high-quality multimodal content, marking a notable contribution to the field of AI-driven multimedia creation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  2. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18370–18380, 2023.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  5. Sound2sight: Generating visual dynamics from sound and context. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 701–719. Springer, 2020.
  6. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  7. Videocrafter1: Open diffusion models for high-quality video generation, 2023.
  8. Xin Chen. Animeganv2. https://github.com/TachibanaYoshino/AnimeGANv2/, 2022.
  9. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  10. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  11. Clipsonic: Text-to-audio synthesis with unlabeled videos and pretrained language-vision models. arXiv preprint arXiv:2306.09635, 2023.
  12. Conditional generation of audio from video via foley analogies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2436, 2023.
  13. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  14. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
  15. Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474, 2023.
  16. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  17. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  18. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  19. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  20. Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940, 2023.
  21. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  22. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  23. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
  24. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023a.
  25. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
  26. Taming visually guided sound generation. arXiv preprint arXiv:2110.08791, 2021.
  27. KeplerLab. Tool for automating common video key-frame extraction, video compression and image auto-crop/image-resize tasks. https://github.com/keplerlab/katna, 2021.
  28. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022.
  29. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
  30. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  31. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. arXiv preprint arXiv:2306.17203, 2023.
  32. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  35. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022.
  36. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023.
  37. I hear your true colors: Image guided audio generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  38. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  39. Physics-driven diffusion models for impact sound synthesis from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9749–9759, 2023.
  40. Galip: Generative adversarial clips for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14214–14223, 2023.
  41. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  42. Towards audio to scene image synthesis using generative adversarial network. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 496–500. IEEE, 2019.
  43. Wav2clip: Learning robust audio representations from clip. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4563–4567. IEEE, 2022.
  44. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  45. Diverse and aligned audio-to-video generation via text-to-video model adaptation. arXiv preprint arXiv:2309.16429, 2023a.
  46. Diverse and aligned audio-to-video generation via text-to-video model adaptation. arXiv preprint arXiv:2309.16429, 2023b.
  47. Audio-to-image cross-modal generation. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022.
  48. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  49. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  50. Moviefactory: Automatic movie creation from text using large generative models for language and images. arXiv preprint arXiv:2306.07257, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yazhou Xing (10 papers)
  2. Yingqing He (23 papers)
  3. Zeyue Tian (12 papers)
  4. Xintao Wang (132 papers)
  5. Qifeng Chen (187 papers)
Citations (32)