Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation (2405.17842v2)

Published 28 May 2024 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: This study aims to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We show that this guidance can be computed using the gradient of the optimal discriminator, which distinguishes real audio-video pairs from fake ones independently generated by the base models. Based on this analysis, we construct a joint guidance module by training this discriminator. Additionally, we adopt a loss function to stabilize the discriminator's gradient and make it work as a noise estimator, as in standard diffusion models. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multimodal alignment with relatively few parameters. The code is available at: https://github.com/SonyResearch/MMDisCo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  3. One transformer fits all distributions in multi-modal diffusion at scale. In Proceedings of the International Conference on Machine Learning, 2023.
  4. Denoising likelihood score matching for conditional score-based data generation. In Proceedings of the International Conference on Learning Representations, 2022.
  5. Vggsound: A large-scale audio-visual dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems, 2024.
  7. Diffusion models beat gans on image synthesis. In Proceedings of the Advances in Neural Information Processing Systems, 2021.
  8. Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 2011.
  9. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 2018.
  10. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  11. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, 2014.
  12. Diffusion models as plug-and-play priors. In Proceedings of the Advances in Neural Information Processing Systems, 2022.
  13. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In Proceedings of the International Conference on Learning Representations, 2024.
  14. Classifier-free diffusion guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  15. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, 2020.
  16. The power of sound (tpos): Audio reactive video generation with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  17. Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. In Proceedings of the Interspeech, 2019.
  18. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015.
  19. Sound-guided semantic video generation. In Proceedings of the European conference on computer vision, 2022.
  20. Aadiff: Audio-aligned video synthesis with text-to-image diffusion. CVPR2023 Workshop on AI for Content Creation, 2023.
  21. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  22. AudioLDM: Text-to-audio generation with latent diffusion models. In Proceedings of the International Conference on Machine Learning, 2023.
  23. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Proceedings of the Advances in Neural Information Processing Systems, 2022a.
  24. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  25. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. In Proceedings of the Advances in Neural Information Processing Systems, 2023.
  26. Diffava: Personalized text-to-audio generation with visual alignment. arXiv preprint arXiv:2305.12903, 2023.
  27. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  28. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  29. Score-based generative modeling through stochastic differential equations. In Proceedings of the International Conference on Learning Representations, 2021.
  30. Physics-driven diffusion models for impact sound synthesis from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  31. Any-to-any generation via composable diffusion. In Proceedings of the Advances in Neural Information Processing Systems, 2023.
  32. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  33. Group normalization. In Proceedings of the European conference on computer vision, 2018.
  34. Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  35. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2023.
  36. Diverse and aligned audio-to-video generation via text-to-video model adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
  37. Generating videos with dynamics-aware implicit generative adversarial networks. In Proceedings of the International Conference on Learning Representations, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com