HunyuanVideo-Foley: End-to-End Audio Synthesis

Updated 28 August 2025

HunyuanVideo-Foley is a novel framework for synchronized Foley audio generation that integrates scalable data pipelines, self-supervised alignment, and dual-stream transformers.
It employs advanced latent diffusion training and REPA loss to stabilize audio feature alignment and enhance semantic fidelity across modalities.
Comprehensive benchmarks show state-of-the-art performance in audio quality, temporal synchronization, and multimodal semantic alignment.

HunyuanVideo-Foley is an end-to-end framework for synchronized Foley audio generation from videos and text, designed to address core challenges in multimodal sound synthesis: data scarcity, modality imbalance, and limited audio quality. The system leverages a scalable curation pipeline for acquiring 100,000 hours of text–video–audio pairs, introduces a self-supervised representation alignment strategy for stable and high-fidelity latent diffusion training, and utilizes a novel multimodal diffusion transformer for joint audio–video fusion and textual semantic injection. Comprehensive benchmarks demonstrate state-of-the-art performance in audio quality, semantic alignment, and temporal synchronization (Shan et al., 23 Aug 2025).

1. Scalable Data Pipeline for Multimodal Annotation

The HunyuanVideo-Foley pipeline performs multi-stage automated curation:

Input Curation: Raw video datasets are processed, discarding videos without audio.
Scene Detection/Segmentation: Long videos are segmented into 8-second clips via scene detection. Segments with more than 80% silence ratio are filtered out, ensuring sufficient audio activity.
Quality Assurance: Bandwidth analysis enforces a minimum effective sampling rate above 32 kHz. Signal-to-noise ratio (SNR) and the AudioBox-aesthetic-toolkit are applied for further filtering.
Automated Annotation: Audio classification and speech–music detection models tag segments for supervised balancing. Audio captions are generated via GenAU, allowing for subsequent text conditioning and semantic preservation.

This curation strategy enables diverse and robust training, overcoming previously limited or subjectively annotated multimodal datasets.

2. Representation Alignment via Self-Supervised Audio Features

High-fidelity audio generation is achieved by aligning diffusion model representations with self-supervised audio features:

Feature Extraction: Frame-level semantic and acoustic features are derived from a pretrained ATST-Frame encoder.
Latent Alignment: During diffusion training, intermediate latent representations $H$ ({from DiT blocks}) are mapped via an MLP and compared with the ATST features $F_r$ using the REPA loss:

$\mathcal{L}_{\text{REPA}} = -\frac{F_r \cdot H}{|F_r| \cdot |H|}$

Significance: This maximizes cosine similarity, stabilizing training, enhancing audio fidelity, and ensuring the latent encoding reflects semantic and acoustic nuances present in authentic audio data.

A plausible implication is that such explicit representation alignment reliably closes the semantic gap between target Foley sound and its conditioning modalities.

3. Multimodal Diffusion Transformer Architecture

The framework’s key architectural advancement is the multimodal diffusion transformer, designed to fuse visual, audio, and textual signals with fine-grained temporal and semantic precision:

Hybrid Transformer Design:
- $N_1$ multimodal transformer (MMDiT) blocks perform dual-stream fusion of audio and video data.
- $N_2$ unimodal audio-DiT blocks further refine audio latents.
Feature Encoders:
- Visual Encoding: SigLIP-2 encodes video frames into visual features.
- Text Encoding: CLAP embeddings deliver global textual semantics.
- Audio Encoding: DAC-VAE compresses audio waveforms into latent codes.
Temporal Fusion: Joint sequences are formed by interleaving audio and visual tokens in time, using rotary position embeddings (RoPE). For $t$ , the interleaved sequence $F_{av}$ is:

$F_{av}[:, 2t-1, :, :] = x[:, t, :, :] \ F_{av}[:, 2t, :, :] = F_v[:, t, :, :]$

Attention Mechanisms:
- Self-attention in early layers intertwines local temporal info.
- Cross-attention in later stages injects global textual semantics.

This dual-phase strategy resolves modal competition—where text might otherwise overpower temporally detailed audio-visual cues—by separately handling frame-level fusion and semantic injection.

4. Synchronization and Feature Modulation

HunyuanVideo-Foley incorporates additional conditioning and modulation strategies:

Synchronization Features: Timestep alignment and synchronization features from Synchformer are introduced.
Dynamic Modulation: Layer outputs are adaptively gated and normalized using computed parameters:

$\alpha = W_\alpha \cdot \mathrm{SiLU}(c), \quad \beta = W_\beta \cdot \mathrm{SiLU}(c), \quad g = W_g \cdot \mathrm{SiLU}(c)$

where $c$ is the flow feature input, $W_\alpha, W_\beta, W_g$ are learned matrices, and $\mathrm{SiLU}$ is the Sigmoid-weighted Linear Unit.

Effect: These modulations maintain strict temporal alignment between modalities across attention blocks, enhancing synchronization quality.

5. Empirical Evaluation and Benchmarks

Performance is rigorously evaluated across Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench, with the following key metrics:

Category	Metric(s)	Purpose
Distribution Matching	Fréchet Distance (FD), KL	Generated/real audio similarity
Audio Quality	Inception Score (IS), PQ/PC	Fidelity, complexity, aesthetics
Visual-Semantic Alignment	ImageBind Cosine Similarity	Audio–video semantic coherence
Temporal Alignment	DeSync (Synchformer)	Event-level synchronization
Text-Semantic Consistency	CLAP Score	Fidelity to text prompt

HunyuanVideo-Foley consistently sets new state-of-the-art values across these dimensions, outperforming prior V2A and TV2A models both in overall audio quality and modality alignment.

6. Comparison, Impact, and Extensions

Relative to prior methods such as AutoFoley (Ghose et al., 2020), FoleyGAN (Ghose et al., 2021), Diff-Foley (Luo et al., 2023), Video-Foley (Lee et al., 21 Aug 2024), CAFA (Benita et al., 9 Apr 2025), and FoleySpace (Zhao et al., 18 Aug 2025), HunyuanVideo-Foley offers:

Scalable multimodal data coverage via automated annotation and curation.
Integrated representation alignment for robust training and quality improvements (using REPA loss).
Explicit dual-stage transformer attention for resolving modality imbalance in TV2A synthesis.
Comprehensive evaluation protocols demonstrating best-in-class temporal and semantic alignment.

These innovations indicate that multimodal representation alignment and dual-stream attention are highly effective in large-scale Foley audio generation contexts. Potential extensions include expanding the modality space (e.g., spatial audio, physical simulation inputs) and exploring real-time streaming applications.

7. Future Research Directions

Future work highlighted includes:

Further dataset scaling and diversification to cover new genres and environments.
Advanced fusion strategies for multi-source and spatial audio, possibly incorporating region-level video conditioning or external semantic controls.
Model optimization for lower latency, enabling interactive or real-time media editing contexts.
Exploration of human-in-the-loop paradigms, leveraging annotation/refinement by Foley artists for collaborative production pipelines.

This suggests that HunyuanVideo-Foley could substantially influence post-production workflows across film, gaming, and immersive media, establishing benchmarks for high-fidelity, contextually integrated machine Foley synthesis.