SonicVisionLM: Playing Sound with Vision Language Models (2401.04394v3)

Published 9 Jan 2024 in cs.MM, cs.SD, and eess.AS

Abstract: There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision-LLMs(VLMs). Instead of generating audio directly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/

PDF HTML Abstract

SonicVisionLM: Advancing Video-Sound Generation with Vision-LLMs

The research paper introduces SonicVisionLM, a sophisticated framework designed to generate sound effects for silent videos using Vision-LLMs (VLMs). The primary motivation lies in overcoming the challenges faced by existing methods in aligning visual and audio representations, which is critical for practical applications in video post-production. The proposed framework transforms this alignment challenge by breaking it down into the more manageable tasks of image-to-text and text-to-audio alignment, harnessing the capabilities of LLMs and diffusion models.

Framework Overview

SonicVisionLM distinguishes itself by integrating three pivotal components: a video-to-text converter, a text-based interaction module, and a text-to-audio generation mechanism. This method deviates from direct audio generation from visual data, which often leads to synchronization issues and semantic irrelevance. Instead, by leveraging VLMs, the system first parses a silent video to identify events and suggests possible accompanying sounds. This approach transforms the complex task of video-audio alignment into more tractable subtasks that utilize the strengths of existing natural language processing models.

The key innovation in this framework is the introduction of a time-controlled audio adapter, which facilitates precise synchronization between audio and video events. Additionally, the authors have developed a dataset, CondPromptBank, featuring over ten thousand entries across 23 categories, mapping text descriptions to specific sound effects to enhance the training and performance of their model.

Methodology and Components

Video-to-Text Transformer: This component uses VLMs to produce text descriptions of on-screen events from video inputs. By doing so, it transitions the problem from generating audio directly from video to one of generating text from video.
Timestamp Detection: Trained on a curated dataset, this module employs a ResNet(2+1)-D18 network to predict time-specific information, which is crucial for guiding the audio generation step and ensuring temporal accuracy.
Text-Based Interaction: This module allows for user input to refine or introduce new text-timestamp pairs, effectively adding a level of customizability and adaptability to the sound design process.
Text-to-Audio Generation: Using a Latent Diffusion Model (LDM), this component translates text and timestamps into synchronized, diverse audio outputs. The novel time-controlled adapter is pivotal in aligning the generated audio with the video events accurately.

Experimental Results

SonicVisionLM is rigorously evaluated against state-of-the-art methods, showing superior performance across both conditional and unconditional sound generation tasks. The framework achieves commendable results, notably improving synchronization and semantic accuracy, as evidenced in metrics like CLAP scores and IoU, with notable improvements in conditional generation tasks (e.g., IoU improving from 22.4 to 39.7). These results underscore the model's ability to not only produce audio that matches the targeted visual elements but also maintain precise timing, which is crucial for real-world applications.

Implications and Future Directions

SonicVisionLM highlights the potential of utilizing VLMs in multimedia applications, particularly in automating labor-intensive tasks of video post-production such as sound effect integration. From a practical standpoint, this research offers sound designers a tool for efficiently producing high-quality, synchronized audio, reducing manual effort significantly.

Theoretically, the introduction of time-controlled features into text-audio models opens new avenues for exploring temporal dynamics in multimodal machine learning tasks. Future research could build upon this framework by expanding the complexities of soundscapes it can generate, perhaps integrating AI learning from user edits over time to improve its recommendation engine further.

In summary, SonicVisionLM represents a notable advancement in the domain of video-sound generation, leveraging the capabilities of LLMs to simplify and improve the workflow significantly. The research sets a robust foundation for future explorations in automated audio synthesis from visual content, suggesting exciting possibilities for enhanced multimedia production processes.