DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos

Published 3 Apr 2026 in cs.SD | (2604.02781v1)

Abstract: Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but also on scene geometry, materials, and dynamic interactions with the environment. However, existing approaches only rely on visual cues and fail to model dynamic sources and acoustic effects such as occlusion, reflections, and reverberation. To address these challenges, we propose DynFOA, a generative framework that synthesizes FOA from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. DynFOA analyzes the input video to detect and localize dynamic sound sources, estimate depth and semantics, and reconstruct scene geometry and materials using 3D Gaussian Splatting (3DGS). The reconstructed scene representation provides physically grounded features that capture acoustic interactions between sources, environment, and listener viewpoint. Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context. We introduce M2G-360, a dataset of 600 real-world clips divided into MoveSources, Multi-Source, and Geometry subsets for evaluating robustness under diverse conditions. Experiments show that DynFOA consistently outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a conditional diffusion framework that fuses 3D scene reconstruction with FOA synthesis for dynamic, acoustically complex 360° videos.
It employs multi-modal supervision and physics-informed conditioning—via geometric, depth, and material priors—to significantly reduce DOA errors and spectral artifacts.
The approach is validated on the new M2G-360 dataset, demonstrating over 33% improvements in spatial and acoustic fidelity compared to prior methods.

DynFOA: Conditional Diffusion-Based FOA Generation Leveraging Dynamic 3D Scene Understanding in 360-Degree Videos

Introduction

First-order ambisonics (FOA) spatial audio is essential for fully immersive 360-degree video and VR experiences, yet the majority of 360-degree content is distributed with purely monaural or stereo soundtracks due to the technical challenges of authentic spatial audio capture. The synthesis of FOA from 360-degree videos represents a long-standing, ambitious challenge, primarily due to the considerable gap between visual context and real-world environmental acoustic phenomena. DynFOA addresses this gap by presenting a generative framework that leverages detailed 3D scene reconstruction and physics-informed conditioning in a diffusion-based generative model for FOA synthesis, establishing new performance benchmarks and redefining robustness under complex acoustic scenarios (2604.02781).

Architecture and Methodology

DynFOA realizes scene-aware FOA generation through a three-stage architecture:

Video Encoder: Extracts a rich set of geometric and material priors from monocular 360-degree video. It localizes sound-emitting objects, estimates dense per-pixel depth, executes semantic segmentation to classify acoustic surfaces, and reconstructs an explicit 3D scene using 3D Gaussian Splatting (3DGS). This pipeline yields explicit features relevant to acoustic interactions, including occlusion maps, reflection paths, and frequency-dependent reverberation times.
FOA Latent Encoder: During training, this module encodes ground-truth FOA channels into compact, geometry- and material-aware latent representations, extracting spectral, directional, and propagation cues via CNNs, harmonic transforms, and attention mechanisms fused with visual saliency maps.
Conditional Diffusion Generator: The core synthesizer employs a multi-condition encoder and cross-modal fusion to condition a U-Net denoiser on the complex set of geometric, material, and dynamic cues. At inference, the FOA Latent Encoder is dropped; diffusion operates purely on video-conditioned features to generate high-fidelity FOA signals, decoded by a pretrained VAE.
Figure 1: Architecture of DynFOA highlighting the flow from 360-degree video input through scene analysis and conditional diffusion generation, culminating in FOA output.

The synthesis process is strictly physics-informed, explicitly constraining the denoising trajectory: material-dependent absorption, geometry-driven occlusion, and variable reverberation times are injected as conditioning signals into the generative process. The approach is designed for dynamic, multi-source, and highly reverberant scenes, with efficient head tracking and HRTF-based binaural rendering for practical VR deployment.

Dataset Contribution: M2G-360

Existing benchmarks such as Sphere360 and YT-360 are either acoustically simplistic or dominated by static source placement, failing to probe the limits of FOA generation under realistic scene complexity. The authors construct M2G-360, a curated dataset of 600 360-degree video clips with matched 4-channel FOA tracks, systematically partitioned into subsets targeting dynamic occlusion ("MoveSources"), overlapping sources ("Multi-Source"), and diverse geometry/material configurations ("Geometry"). M2G-360 enables rigorous, multidimensional evaluation of spatial robustness, frequency-dependent reverberation, and disentangling of colocated sources in FOA synthesis.

Experimental Results

DynFOA is systematically benchmarked against leading approaches, including OmniAudio, ViSAGe, Diff-SAGe, and MMAudio+SP. Evaluations span both conventional and newly proposed datasets, with a broad set of quantitative metrics:

Spatial Accuracy: Angular Direction-of-Arrival (DOA) estimation.
Acoustic Fidelity: Signal-to-Noise Ratio (SNR) and Early Decay Time (EDT) for reverberation analysis.
Distribution Matching: Fréchet Distance (FD), Kullback-Leibler (KL) divergence, Short-Time Fourier Transform (STFT) error, and SI-SDR.
Human Perception: Mean Opinion Scores (MOS) for spatial quality (MOS-SQ) and audiovisual alignment (MOS-AF), rated in head-tracked VR setups.

On Sphere360, DynFOA achieves a DOA error reduction of 26.3% and an EDT decrease of 33.3% over OmniAudio. KL divergence and STFT error improve by over 32%, indicating sharper alignment with ground-truth distribution and reduced perceptual artifacts. Human MOS scores reach 4.35±0.22 for spatial quality and 4.12±0.25 for alignment, consistently outperforming baselines.

Notably, across M2G-360's most challenging subsets, DynFOA posts:

46.7% lower DOA error in MoveSources, maintaining spatial coherence under heavy occlusion and source movement.
33.3% reduction in STFT and FD in Multi-Source, demonstrating robust disentangling of overlapping energy fields.
40.0% lower EDT in Geometry, validating precise modeling of long-tail reverberation and material effects.
Figure 2: Mel-spectrogram visualizations for FOA channels ( $W$ , $X$ , $Y$ , $Z$ ) in a highly reverberant piano scene: only DynFOA recovers high-frequency energy and stable spatial correlation, closely tracking ground truth.

The ablation studies unequivocally show that stepwise inclusion of geometric, depth, and material priors drives all core metrics closer to the upper performance bound, whereas audio-only or purely visual baselines plateau at significantly lower accuracy.

Key Claims and Analysis

Explicit structural and material conditioning enables physically plausible FOA generation in highly dynamic scenes, opposed to prevailing models relying purely on global 2D visual features and static context assumptions.
Conditional diffusion, grounded on 3D GS reconstructions and frequency-dependent material priors, reduces acoustic hallucination and spurious spatial drift common in unconstrained generative models.
M2G-360 provides an indispensable, rigorously filtered benchmark exposing FOA models to previously unexplored extremes of real-world acoustic complexity.

The empirical superiority is strongest in scenes with strong occlusion, dynamic source motion, and prominent late reverberation, where simpler models fail to maintain either localization or physical coherence.

Theoretical and Practical Implications

The methodological advances in DynFOA highlight a decisive shift from generic cross-modal alignment towards physics-informed, structure-driven generation of spatial audio. This reorientation directly addresses a longstanding challenge: bridging the multimodal gap between video-centric object localization and variable, multi-path acoustic propagation. Integrating 3D Gaussian Splatting, material lookup tables, explicit path analysis for occlusion/reflection, and conditional latent diffusion sets a new technical precedent for the field.

Practically, the pipeline paves the way for high-fidelity FOA synthesis from consumer-grade 360-degree video. DynFOA's capacity for real-time, head-tracked binaural rendering with accurate handling of reverberation and source movement suits not only VR/AR but also cinematic content creation, robotics, and interactive media requiring spatially coherent soundscapes.

Outlook and Future Directions

Persisting limitations include approximate material property estimation via semantic segmentation, restricting the model's ability to capture nuanced, frequency-dependent acoustic behaviors in complex architectural or outdoor scenes. Direct integration of acoustic sensor data, refinement of 3DGS with hybrid LiDAR/photogrammetry, or self-supervised estimation of material absorption spectra represent promising future enhancements. Additionally, transitioning towards end-to-end differentiable pipelines may unlock more robust generalization over diverse real-world environments and lower the practical complexity of training and deployment.

Conclusion

DynFOA establishes a new paradigm for FOA generation by employing dynamic 3D scene reconstruction and conditional latent diffusion, unifying vision-derived geometric priors with physically realistic audio generation. The strong numerical improvements across both classic and newly constructed benchmarks, particularly in DOA, EDT, and MOS metrics, substantiate the indispensability of explicit acoustic conditioning for robust FOA synthesis. M2G-360 provides a solid foundation for further advances in this domain, and the DynFOA approach offers a robust framework for the next generation of immersive spatial audio rendering systems.

Markdown Report Issue