Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling (2312.05412v2)

Published 8 Dec 2023 in cs.LG, cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  2. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  3. Video background music generation with controllable music transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2037–2045, 2021.
  4. Clap learning audio concepts from natural language supervision. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  5. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
  6. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
  7. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
  8. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  9. Adapting frechet audio distance for generative music evaluation, 2023.
  10. Efficient diffusion training via min-snr weighting strategy. arXiv preprint arXiv:2303.09556, 2023.
  11. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  12. Video diffusion models. In Advances in Neural Information Processing Systems, pages 8633–8646. Curran Associates, Inc., 2022.
  13. EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound. In IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP), 2023.
  14. The power of sound (tpos): Audio reactive video generation with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7822–7832, 2023.
  15. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  16. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
  17. Fr\’echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms, 2019. arXiv:1812.08466 [cs, eess].
  18. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  19. HiFi-GAN: Generative adversarial networks forefficient and high fidelity speech synthesis. In Proceesings of 34th Conference on Neural Information Processing Systems, 2020a.
  20. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020b.
  21. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
  22. Sound-guided semantic video generation. In European Conference on Computer Vision, pages 34–50. Springer, 2022.
  23. Soundini: Sound-guided diffusion for natural video editing. arXiv preprint arXiv:2304.06818, 2023.
  24. Controllable text-to-image generation. Advances in Neural Information Processing Systems, 32, 2019.
  25. Learn to dance with aist++: Music conditioned 3d dance generation, 2021.
  26. Audioldm 2: Learning holistic audio generation with self-supervised pretraining, 2023.
  27. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  28. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. arXiv preprint arXiv:2306.17203, 2023.
  29. librosa/librosa: 0.10.1, 2023.
  30. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  31. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  32. Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  33. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  36. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  37. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023.
  38. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
  39. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
  40. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
  41. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  42. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  43. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.
  44. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  45. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.
  46. Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing. In Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, Netherlands, 2019.
  47. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  48. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  49. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems, 35:23371–23385, 2022.
  50. Lossy image compression with conditional diffusion models. arXiv preprint arXiv:2209.06950, 2022.
  51. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
  52. Determining the amount of audio-video synchronization errors perceptible to the average end-user. IEEE Transactions on Broadcasting, 54(3):623–627, 2008.
  53. A survey on audio diffusion models: Text to speech synthesis and enhancement in generative ai. arXiv preprint arXiv:2303.13336, 2, 2023.
  54. Discrete contrastive diffusion for cross-modal music and image generation. In The Eleventh International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ruihan Yang (43 papers)
  2. Hannes Gamper (24 papers)
  3. Sebastian Braun (29 papers)
Citations (2)