Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound (2408.11915v2)

Published 21 Aug 2024 in cs.SD, cs.CV, cs.LG, cs.MM, and eess.AS

Abstract: Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor alignment and controllability, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an intuitive condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope closely related to audio semantics, acts as a temporal event feature to guide audio generation from video. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Source code, model weights and demos are available on our companion website. (https://jnwnlee.github.io/video-foley-demo)

Citations (5)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition for Synchronized Foley Sound Synthesis

Tweets

https://twitter.com/fly51fly/status/1827831621325877309

https://twitter.com/arXivGPT/status/1827471652218925353

https://twitter.com/javaeeeee1/status/1827738839470817766

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound (2408.11915v2)

Summary

Related Papers

GitHub

Tweets