FoleyBench: Benchmark for Video-to-Audio Models
- FoleyBench is a comprehensive benchmark defining a dataset of 5,000 clips and extended long-form videos, ensuring precise temporal alignment and rich Foley-style categories.
- It utilizes an automated multi-stage pipeline with scene detection, YAMNet filtering, and Gemini grounding to ensure robust audio-visual correspondence.
- Evaluation metrics such as FAD, IS, and CLAP provide actionable insights on discrete impacts, multi-source scenes, and long-form consistency in V2A generation.
FoleyBench is a large-scale benchmark specifically designed for the evaluation of video-to-audio (V2A) generation models in Foley-style scenarios, i.e., modeling sound effects causally and temporally synchronized with visible, non-speech, non-music events in video. The dataset and its protocols address longstanding deficiencies in existing V2A benchmarks, which are dominated by speech/music and frequently lack clear audio-visual correspondence relevant to post-production, AR/VR, and sound design applications involving visible Foley effects (Dixit et al., 17 Nov 2025).
1. Construction and Design of the FoleyBench Dataset
FoleyBench leverages a multi-stage, automated pipeline to create a dataset containing 5,000 (video, ground-truth audio, text caption) triplets, each 8–10 seconds in length, along with an extended long-form set ("FoleyBench-Long") of 650 videos each 30 seconds in length.
- Data collection uses Creative-Commons licensed videos sourced from YouTube (FineVideo, LVBench) and Vimeo (V3C1).
- Scene detection is performed automatically, discarding segments shorter than 8 seconds to ensure sufficient temporal context.
- Audio filtering with YAMNet removes clips dominated by speech or music, rejecting frames with a speech/music score ≥ 0.6; this stage eliminates 97.7% of the original candidate pool. Manual spot-checks reveal a post-filtering precision of 47% for true Foley clips.
- Audio-visual grounding (using Gemini 2.5 Pro) accepts only clips where sounds are causally and temporally tied to visible on-screen actions, evidenced by a 72% precision on validation data.
- Yield: 5,000 clips for the core set (~12 hours), and an additional 650 longer clips for long-form evaluation (~5.4 hours). Each example preserves visible sound sources and strict temporal alignment, e.g., hammer impacts coinciding with visible contact.
In contrast to previous datasets (e.g., VGGSound), FoleyBench ensures robust coverage of Foley-relevant categories; only 13.4% of Universal Category System (UCS) categories have ≤ 3 clips (versus 24.3% in filtered VGGSound), and the Shannon entropy over UCS categories reaches 5.35 (vs. 4.73 in VGGSound-filtered), demonstrating greater diversity relevant to Foley modeling.
| Dataset Segment | Number of Clips | Duration per Clip | Total Duration |
|---|---|---|---|
| FoleyBench (core) | 5,000 | 8–10 s | 12 hours |
| FoleyBench-Long | 650 | 30 s | 5.4 hours |
2. Foley-Specific Taxonomy and Metadata
FoleyBench employs a comprehensive taxonomy and rich metadata schema to enable granular analysis of both dataset content and V2A model performance:
- Sound category labeling leverages the UCS and AudioSet taxonomies at the top-class level.
- Envelope classification distinguishes between "Discrete" (isolated impacts, e.g., door slam) and "Rest"/continuous sounds (e.g., running water).
- Source complexity is annotated as "Single-source" or "Multi-source."
- Acoustic focus is labeled as "Background," "Action," or "Combined."
- Additional metadata includes a Gemini-generated one-line caption, a YAMNet coarse label for prompting, and detailed JSON-formatted labels containing all above attributes with rationale where applicable.
This approach accommodates both high-level category analysis and fine-grained breakdown of model strengths and failure modes, especially in challenging multi-source or continuous ambience contexts.
3. Evaluation Metrics and Protocols
Evaluation of V2A models on FoleyBench is standardized along audio quality and cross-modal alignment axes, utilizing six principal metrics (all PANN-embedding-based unless stated otherwise):
- Audio Quality:
- Fréchet Audio Distance (FAD): Quantifies the distributional distance between generated and real audio; lower is better.
- Inception Score (IS): Assesses the diversity and sharpness of audio events; higher indicates superior generation.
- Kullback–Leibler Divergence (KLD): Measures class-distribution match; lower values reflect improved alignment with ground-truth classes.
- Cross-Modal Alignment:
- ImageBind Score (IB): Cosine similarity between video and audio embeddings; higher reflects superior semantic correspondence.
- CLAP Score: Cosine similarity between text caption and audio embeddings; reported for text-conditioned models only.
- De-Sync: Average absolute difference in event timing between audio and video; lower indicates tighter temporal synchronization.
In contrast to prior V2A benchmarks, which are largely insensitive to discrete impacts and fine alignment due to speech/music bias, FoleyBench's metrics and category coverage highlight discrete and ambient Foley effects—critical for application realism.
| Metric | Measures | Optimal Value |
|---|---|---|
| FAD | Audio distributional similarity | Lower |
| IS | Audio event diversity/sharpness | Higher |
| KLD | Class-distribution match | Lower |
| IB | Video-audio semantic alignment | Higher |
| CLAP | Text-audio semantic alignment | Higher |
| De-Sync | Temporal alignment (sec) | Lower |
4. Benchmarking and Analysis of V2A Models
Twelve state-of-the-art V2A models spanning autoregressive, diffusion, flow, and adapter-based architectures have been evaluated on the FoleyBench 5,000-clip test split:
- MMAudio (flow-matching, video–text–audio joint model) achieves the best FAD (8.76), IS (11.2), KLD (2.43), second-best IB (0.306), and De-Sync (0.447 s).
- Seeing & Hearing attains the highest IB (0.371) via masked spectrograms with test-time ImageBind optimization.
- CAFA (ControlNet adapter) accomplishes strong CLAP (0.270) and leads among adapters for FAD (15.5).
- V-AURA (autoregressive) produces best-in-class temporal synchronization (De-Sync = 0.716 s).
Fine-grained ablations on MMAudio indicate:
- For "Discrete" impacts, De-Sync improves (0.458 → 0.390 s), but FAD worsens (9.02 → 16.35) and IS declines (10.3 → 8.8).
- "Background" ambiences see degraded sync (0.405 → 0.636 s) and IS (11.98 → 4.61), but improved KLD (2.54 → 1.98).
- Multi-source scenes improve semantic IB (0.296 → 0.324), while FAD (9.84 → 11.34) and De-Sync (0.436 → 0.467 s) deteriorate.
- Introduction of text conditioning (SpecMaskFoley ablation) uniformly improves all metrics (e.g., FAD 23.18 → 19.60).
Long-form generation (FoleyBench-Long) is feasible for LOVA, VTA-LDM, and MMAudio. Notably, MMAudio's audio quality measured by FAD degrades (8.76 → 27.5) on long-form, but maintains the best semantic alignment (IB = 0.239, De-Sync = 0.638 s). LOVA yields best long-form FAD (26.2) and IS (5.02), trading off alignment quality (Dixit et al., 17 Nov 2025).
5. Implications and Future Research Directions
Major insights from FoleyBench include:
- Temporal alignment is within reach for state-of-the-art models, yet accurate rendering of "what" the sound should be—especially for discrete impacts and complex ambiences—remains challenging.
- Multi-source scenes often yield audio "mash-ups" that, while semantically plausible, lack auditory fidelity characteristic of true Foley.
- Ambient/background environments and long-form audio coherence constitute open challenges.
- Text conditioning functions as a semantic prior, signaling that purely visual V2A approaches are often under-constrained.
- Standard datasets hide these nuanced failure modes due to speech/music dominance and sparse Foley-relevant coverage.
Suggested directions for advancing the field include:
- Development of visual encoders with explicit modeling of object materials and interaction forces.
- Enhanced ambient synthesis modules leveraging environmental sound libraries.
- Improved disentanglement and mixing strategies for multi-source scenarios.
- Joint pretraining of video, text, and audio modalities at scale to capture semantic context more robustly.
- Architecture innovations targeting long-form consistency, such as memory-augmented diffusion or autoregressive designs.
- Ongoing expansion and refinement of FoleyBench, with more sound categories, extended durations, and validation involving human experts (Dixit et al., 17 Nov 2025).
By establishing a rigorously curated, taxonomically diverse, and causally-grounded benchmark with precise evaluation protocols, FoleyBench provides a foundation for advancing Foley-style V2A research and for diagnosing failure modes absent from prior datasets.