Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 109 tok/s Pro

Kimi K2 204 tok/s Pro

GPT OSS 120B 442 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model (2505.13062v3)

Published 19 May 2025 in cs.MM, cs.SD, and eess.AS

Abstract: Humans can intuitively infer sounds from silent videos, but whether multimodal LLMs can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-LLMs' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference.