CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling (2312.05412v2)

Published 8 Dec 2023 in cs.LG, cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (3)

Ruihan Yang (43 papers)
Hannes Gamper (24 papers)
Sebastian Braun (29 papers)

Citations (2)

View on Semantic Scholar

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling (2312.05412v2)

Related Papers