Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation (2212.09478v2)

Published 19 Dec 2022 in cs.CV
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Abstract: We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.

Overview of MM-Diffusion for Joint Audio-Video Generation

The paper presents MM-Diffusion, a pioneering framework for generating high-quality, realistic joint audio-video content. This work extends existing diffusion models, typically focused on a single modality, to a multi-modal application. The framework employs two coupled denoising autoencoders, enabling the simultaneous generation of semantically consistent audio and video. The central innovation lies in the introduction of a sequential multi-modal U-Net architecture that aligns these modalities during the denoising process.

Methodology

Multi-Modal Diffusion Framework

MM-Diffusion leverages a unified model to learn the joint distribution of audio and video. The model incorporates a random-shift based attention mechanism to ensure temporal and semantic alignment between modalities. This novel attention block bridges the audio and video sub-networks, enhancing cross-modal fidelity.

Architecture

The coupled U-Net architecture consists of distinct audio and video streams, each tailored to their respective data patterns. The video stream processes spatial-temporal data using a combination of 1D and 2D convolutions, while the audio stream leverages 1D dilated convolutions to handle long-term dependencies. The integration through random-shift multi-modal attention facilitates efficient inter-modality interactions by reducing temporal redundancies.

Zero-Shot Conditional Generation

Although primarily focused on unconditional generation, MM-Diffusion demonstrates strong performance in zero-shot conditional tasks such as audio-to-video and video-to-audio generation. This is achieved without additional task-specific training, showcasing the robustness and adaptability of the model.

Evaluation

The model was evaluated on the Landscape and AIST++ datasets, outperforming state-of-the-art single-modal generation models, including DIGAN, TATS, and Diffwave. Notably, MM-Diffusion achieved superior scores in both visual and audio quality metrics, with improvements of 25.0% and 32.9% in FVD and FAD, respectively, on the Landscape dataset. The AIST++ results were equally impressive, with gains of 56.7% and 37.7%.

Implications and Future Directions

The development of MM-Diffusion marks a significant advancement in the domain of multi-modal content generation. This model not only demonstrates the capability to generate high-fidelity joint audiovisual content but also lays the groundwork for further explorations into cross-modal generation and editing tasks. Future research may focus on incorporating additional modalities, such as text prompts, to guide the generation process further. The practical applications of such models are vast, spanning entertainment, virtual reality, and automated content creation.

In summary, MM-Diffusion offers a substantial contribution to the field of generative models, addressing the complexities of multi-modal content synthesis with a robust, efficient framework. The successful alignment of audio and video through the proposed methodology sets a foundation for subsequent innovations in multi-modal AI research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ludan Ruan (7 papers)
  2. Yiyang Ma (15 papers)
  3. Huan Yang (306 papers)
  4. Huiguo He (8 papers)
  5. Bei Liu (63 papers)
  6. Jianlong Fu (91 papers)
  7. Nicholas Jing Yuan (22 papers)
  8. Qin Jin (94 papers)
  9. Baining Guo (53 papers)
Citations (127)