Papers
Topics
Authors
Recent
Search
2000 character limit reached

Video-to-Audio Generation with Fine-grained Temporal Semantics

Published 23 Sep 2024 in eess.AS and cs.SD | (2409.14709v1)

Abstract: With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to synchronize the generated video, considering the complicated semantic information included in the latter. In this work, inspired by the recent success of text-to-audio (TTA) generation, we first investigate the video-to-audio (VTA) generation framework based on latent diffusion model (LDM). Similar to latest pioneering exploration in VTA, our preliminary results also show great potentials of LDM in VTA task, but it still suffers from sub-optimal temporal alignment. To this end, we propose to enhance the temporal alignment of VTA with frame-level semantic information. With the recently popular grounding segment anything model (Grounding SAM), we can extract the fine-grained semantics in video frames to enable VTA to produce better-aligned audio signal. Extensive experiments demonstrate the effectiveness of our system on both objective and subjective evaluation metrics, which shows both better audio quality and fine-grained temporal alignment.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (5)

Collections

Sign up for free to add this paper to one or more collections.