Papers
Topics
Authors
Recent
2000 character limit reached

FoleyBench: Benchmark for Video-to-Audio Models

Updated 24 November 2025
  • FoleyBench is a comprehensive benchmark defining a dataset of 5,000 clips and extended long-form videos, ensuring precise temporal alignment and rich Foley-style categories.
  • It utilizes an automated multi-stage pipeline with scene detection, YAMNet filtering, and Gemini grounding to ensure robust audio-visual correspondence.
  • Evaluation metrics such as FAD, IS, and CLAP provide actionable insights on discrete impacts, multi-source scenes, and long-form consistency in V2A generation.

FoleyBench is a large-scale benchmark specifically designed for the evaluation of video-to-audio (V2A) generation models in Foley-style scenarios, i.e., modeling sound effects causally and temporally synchronized with visible, non-speech, non-music events in video. The dataset and its protocols address longstanding deficiencies in existing V2A benchmarks, which are dominated by speech/music and frequently lack clear audio-visual correspondence relevant to post-production, AR/VR, and sound design applications involving visible Foley effects (Dixit et al., 17 Nov 2025).

1. Construction and Design of the FoleyBench Dataset

FoleyBench leverages a multi-stage, automated pipeline to create a dataset containing 5,000 (video, ground-truth audio, text caption) triplets, each 8–10 seconds in length, along with an extended long-form set ("FoleyBench-Long") of 650 videos each 30 seconds in length.

  • Data collection uses Creative-Commons licensed videos sourced from YouTube (FineVideo, LVBench) and Vimeo (V3C1).
  • Scene detection is performed automatically, discarding segments shorter than 8 seconds to ensure sufficient temporal context.
  • Audio filtering with YAMNet removes clips dominated by speech or music, rejecting frames with a speech/music score ≥ 0.6; this stage eliminates 97.7% of the original candidate pool. Manual spot-checks reveal a post-filtering precision of 47% for true Foley clips.
  • Audio-visual grounding (using Gemini 2.5 Pro) accepts only clips where sounds are causally and temporally tied to visible on-screen actions, evidenced by a 72% precision on validation data.
  • Yield: 5,000 clips for the core set (~12 hours), and an additional 650 longer clips for long-form evaluation (~5.4 hours). Each example preserves visible sound sources and strict temporal alignment, e.g., hammer impacts coinciding with visible contact.

In contrast to previous datasets (e.g., VGGSound), FoleyBench ensures robust coverage of Foley-relevant categories; only 13.4% of Universal Category System (UCS) categories have ≤ 3 clips (versus 24.3% in filtered VGGSound), and the Shannon entropy over UCS categories reaches 5.35 (vs. 4.73 in VGGSound-filtered), demonstrating greater diversity relevant to Foley modeling.

Dataset Segment Number of Clips Duration per Clip Total Duration
FoleyBench (core) 5,000 8–10 s 12 hours
FoleyBench-Long 650 30 s 5.4 hours

2. Foley-Specific Taxonomy and Metadata

FoleyBench employs a comprehensive taxonomy and rich metadata schema to enable granular analysis of both dataset content and V2A model performance:

  • Sound category labeling leverages the UCS and AudioSet taxonomies at the top-class level.
  • Envelope classification distinguishes between "Discrete" (isolated impacts, e.g., door slam) and "Rest"/continuous sounds (e.g., running water).
  • Source complexity is annotated as "Single-source" or "Multi-source."
  • Acoustic focus is labeled as "Background," "Action," or "Combined."
  • Additional metadata includes a Gemini-generated one-line caption, a YAMNet coarse label for prompting, and detailed JSON-formatted labels containing all above attributes with rationale where applicable.

This approach accommodates both high-level category analysis and fine-grained breakdown of model strengths and failure modes, especially in challenging multi-source or continuous ambience contexts.

3. Evaluation Metrics and Protocols

Evaluation of V2A models on FoleyBench is standardized along audio quality and cross-modal alignment axes, utilizing six principal metrics (all PANN-embedding-based unless stated otherwise):

  • Audio Quality:
  1. Fréchet Audio Distance (FAD): Quantifies the distributional distance between generated and real audio; lower is better.
  2. Inception Score (IS): Assesses the diversity and sharpness of audio events; higher indicates superior generation.
  3. Kullback–Leibler Divergence (KLD): Measures class-distribution match; lower values reflect improved alignment with ground-truth classes.
  • Cross-Modal Alignment:
  1. ImageBind Score (IB): Cosine similarity between video and audio embeddings; higher reflects superior semantic correspondence.
  2. CLAP Score: Cosine similarity between text caption and audio embeddings; reported for text-conditioned models only.
  3. De-Sync: Average absolute difference in event timing between audio and video; lower indicates tighter temporal synchronization.

In contrast to prior V2A benchmarks, which are largely insensitive to discrete impacts and fine alignment due to speech/music bias, FoleyBench's metrics and category coverage highlight discrete and ambient Foley effects—critical for application realism.

Metric Measures Optimal Value
FAD Audio distributional similarity Lower
IS Audio event diversity/sharpness Higher
KLD Class-distribution match Lower
IB Video-audio semantic alignment Higher
CLAP Text-audio semantic alignment Higher
De-Sync Temporal alignment (sec) Lower

4. Benchmarking and Analysis of V2A Models

Twelve state-of-the-art V2A models spanning autoregressive, diffusion, flow, and adapter-based architectures have been evaluated on the FoleyBench 5,000-clip test split:

  • MMAudio (flow-matching, video–text–audio joint model) achieves the best FAD (8.76), IS (11.2), KLD (2.43), second-best IB (0.306), and De-Sync (0.447 s).
  • Seeing & Hearing attains the highest IB (0.371) via masked spectrograms with test-time ImageBind optimization.
  • CAFA (ControlNet adapter) accomplishes strong CLAP (0.270) and leads among adapters for FAD (15.5).
  • V-AURA (autoregressive) produces best-in-class temporal synchronization (De-Sync = 0.716 s).

Fine-grained ablations on MMAudio indicate:

  • For "Discrete" impacts, De-Sync improves (0.458 → 0.390 s), but FAD worsens (9.02 → 16.35) and IS declines (10.3 → 8.8).
  • "Background" ambiences see degraded sync (0.405 → 0.636 s) and IS (11.98 → 4.61), but improved KLD (2.54 → 1.98).
  • Multi-source scenes improve semantic IB (0.296 → 0.324), while FAD (9.84 → 11.34) and De-Sync (0.436 → 0.467 s) deteriorate.
  • Introduction of text conditioning (SpecMaskFoley ablation) uniformly improves all metrics (e.g., FAD 23.18 → 19.60).

Long-form generation (FoleyBench-Long) is feasible for LOVA, VTA-LDM, and MMAudio. Notably, MMAudio's audio quality measured by FAD degrades (8.76 → 27.5) on long-form, but maintains the best semantic alignment (IB = 0.239, De-Sync = 0.638 s). LOVA yields best long-form FAD (26.2) and IS (5.02), trading off alignment quality (Dixit et al., 17 Nov 2025).

5. Implications and Future Research Directions

Major insights from FoleyBench include:

  • Temporal alignment is within reach for state-of-the-art models, yet accurate rendering of "what" the sound should be—especially for discrete impacts and complex ambiences—remains challenging.
  • Multi-source scenes often yield audio "mash-ups" that, while semantically plausible, lack auditory fidelity characteristic of true Foley.
  • Ambient/background environments and long-form audio coherence constitute open challenges.
  • Text conditioning functions as a semantic prior, signaling that purely visual V2A approaches are often under-constrained.
  • Standard datasets hide these nuanced failure modes due to speech/music dominance and sparse Foley-relevant coverage.

Suggested directions for advancing the field include:

  • Development of visual encoders with explicit modeling of object materials and interaction forces.
  • Enhanced ambient synthesis modules leveraging environmental sound libraries.
  • Improved disentanglement and mixing strategies for multi-source scenarios.
  • Joint pretraining of video, text, and audio modalities at scale to capture semantic context more robustly.
  • Architecture innovations targeting long-form consistency, such as memory-augmented diffusion or autoregressive designs.
  • Ongoing expansion and refinement of FoleyBench, with more sound categories, extended durations, and validation involving human experts (Dixit et al., 17 Nov 2025).

By establishing a rigorously curated, taxonomically diverse, and causally-grounded benchmark with precise evaluation protocols, FoleyBench provides a foundation for advancing Foley-style V2A research and for diagnosing failure modes absent from prior datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FoleyBench.