FoleyGRAM: Aligned Audio-Visual Generation

Updated 8 October 2025

FoleyGRAM is a multimodal generative audio model that synthesizes synchronized Foley effects by aligning video, text, and audio in a unified latent space.
It employs diffusion-based audio synthesis with a ControlNet-inspired module for precise temporal control, ensuring sound events match visual timings.
The model introduces the Gramian Representation Alignment Measure (GRAM) to robustly align semantic embeddings, outperforming traditional pairwise methods on benchmark evaluations.

FoleyGRAM is a multimodal generative audio model designed to synthesize Foley sound effects that are semantically and temporally synchronized with visual events. Developed to advance the state of the art in video-to-audio generation, FoleyGRAM integrates diffusion-based audio synthesis conditioned on video, text, and audio modalities using unified alignment via the Gramian Representation Alignment Measure (GRAM). The model architecture couples domain-specific encoders with semantic and temporal control modules to produce audio that closely aligns with input video semantics and timing, addressing the substantial limitations of prior pairwise alignment frameworks and contributing new evaluation benchmarks on standardized datasets.

1. Multimodal Semantic Alignment with GRAM

FoleyGRAM employs three distinct encoders—EVAClip-ViT-G for video, BERT-B for text, and BEATs for audio—which are jointly trained to map their respective input data into a unified latent space. The Gramian Representation Alignment Measure (GRAM) forms the core of this alignment strategy. Rather than aligning modalities using traditional pairwise cosine similarity, GRAM models the joint embedding space as a high-dimensional parallelotope, with its Gram matrix $G(t, a, v)$ computed from the inner products of modality vectors. The volume of the parallelotope, $Vol(t, a, v) = \sqrt{ \det G(t, a, v) }$ , quantifies semantic alignment—smaller volumes indicate better alignment. Contrastive losses minimize this volume for matched triplets and maximize it for mismatched ones, enabling FoleyGRAM to establish robust semantic control over model conditioning, which prior approaches could not achieve due to disjoint or anchor-based latent spaces (Gramaccioni et al., 7 Oct 2025).

2. Diffusion-Based Audio Synthesis and Temporal Conditioning

The generative backbone of FoleyGRAM is a latent diffusion model (based on Stable Audio) adapted for high-fidelity, 44.1 kHz stereo synthesis. Semantic conditioning is executed via cross-attention on the GRAM-aligned latent space. For temporal control, FoleyGRAM computes a waveform envelope signal from ground truth audio using an RMS window:

$r_i = RMS_i(y) = \sqrt{ \frac{1}{W} \sum_{t = i_h}^{i_h + W} y(t)^2 }$

where $W$ denotes window size and $h$ is the hop size. This control signal is processed through a ControlNet-inspired module and guides the diffusion process, ensuring proper alignment between generated sound events (such as impacts or movements) and their video timing. The full reverse diffusion step is expressed as:

$p_{t-1}(z_{t-1} \mid z_t) = \mathcal{N}(z_{t-1}; \mu_t(z_t, t, F, r_c), \sigma_t^2 I)$

$F$ aggregates multimodal semantic embedding, and $r_c$ is the encoded temporal control vector. This dual conditioning mechanism achieves both semantic richness and temporal fidelity in output audio (Gramaccioni et al., 7 Oct 2025).

3. Evaluation Paradigms and Benchmarking

FoleyGRAM is evaluated on the Greatest Hits dataset, a benchmark comprising videos of physical surface interactions (striking, rubbing) with associated text metadata. The framework uses three key objective metrics:

Metric	Evaluator / Basis	Purpose
Fréchet Audio Distance (FAD)	CLAP, Laion-CLAP	Statistical similarity to GT audio
CLAP-score	Cosine similarity	Semantic fidelity
Fréchet Audio-Visual Distance (FAVD)	Encoded embeddings	AV synchronization

FoleyGRAM outperforms prior baselines across all metrics, and ablation studies confirm that unified AVT (audio-video-text) conditioning yields superior results over mono- or bimodal settings. The ControlNet-based temporal envelope contributes measurably to synchronization accuracy by controlling dynamic aspects such as onset and loudness (Gramaccioni et al., 7 Oct 2025).

4. Associated Methodologies and Comparative Approaches

The domain landscape includes complementary and competing methodologies. Latent diffusion models with CLAP-based text–audio conditioning (Yuan et al., 2023), embedding tuning layers for semantic specificity, and more recent Latent CLAP Loss approaches (Karchkhadze et al., 18 Mar 2024)—which directly align diffusion outputs to CLAP’s perceptual space and eliminate inference-time post-filtering—demonstrate significant improvement in Fréchet Audio Distance (e.g., reductions by more than 1.6 FAD points). SpecMaskFoley leverages ControlNet to inject video timing cues, with a frequency-aware temporal feature aligner bridging deep video features into a time–frequency latent space, achieving rapid inference and high temporal–semantic alignment on VGGSound benchmarks (Zhong et al., 22 May 2025). FoleyGRAM advances these paradigms by integrating unified multimodal GRAM alignment and control mechanisms, empirically surpassing both ControlNet-based and from-scratch models in semantic and temporal performance.

5. Dataset Design, Evaluation Protocols, and Community Initiatives

Standardization in dataset design and evaluation is critical. The Foley Synthesis Challenge (Choi et al., 2022) defines a progressive multi-level framework—categorical generation, sequential text description synthesis, direct video-to-audio generation, and ultimately, multi-channel (e.g., 5.1) output. Evaluations include objective metrics such as Fréchet Inception Distance ( $FID$ ), Inception Score ( $IS$ ), and memorization-aware FID, as well as subjective mean opinion scores (MOS) for fidelity, artifact presence, and relevance. High-quality, professionally curated datasets (e.g., Gaudio Lab in-house recordings) are specified, with minimum category sizes for robust training. This initiative benchmarks progress and enables reproducible advancements in generative Foley synthesis.

6. Practical Applications and Prospective Directions

FoleyGRAM’s unified approach supports practical deployment in film, gaming, broadcast, and multimedia art by automating the generation of context-relevant, synchronized sound effects. The system can be adapted for accessibility, enabling dynamic audio descriptions tightly coupled to visual content. Efficiency gains from integrated loss-based semantic alignment (as in Latent CLAP Loss) and removal of inference post-filtering suggest applicability in real-time and scalable pipelines (Karchkhadze et al., 18 Mar 2024). Extensions may include additional modalities, finer-grained human-control over semantic or temporal aspects, and computational optimization for broader accessibility.

7. Implications for Multimodal Generative Research

FoleyGRAM establishes a precedent for multimodal alignment strategies in generative models—advancing not only Foley synthesis but also broader text–video–audio creation tasks. The GRAM framework’s alignment of heterogeneous modalities in a shared latent space enables robust semantic conditioning that can be generalized to other synthesis domains, such as text-based music and ambient sound generation. Future explorations may focus on the transferability of GRAM alignment, the integration of novel control signals, and the development of scalable, human-interpretable multimodal interaction paradigms.

A plausible implication is that FoleyGRAM’s diffusion-based multimodal alignment and temporal control mechanisms will inform best practices and architectural benchmarks across a spectrum of video-driven generative tasks, fostering reproducibility and innovation in audiovisual synthesis research.