OmniAudio: Spatial Audio from 360° Video
- OmniAudio is a framework that synthesizes First-Order Ambisonics audio from 360° videos by leveraging dual-branch video encoding and spatial VAE components.
- It utilizes the large-scale Sphere360 dataset and a self-supervised flow-matching strategy to significantly enhance spatial realism and audio fidelity.
- The approach is applicable in VR, immersive film, gaming, and telepresence, though challenges remain in handling complex acoustic environments.
OmniAudio is a framework for generating spatial First-Order Ambisonics (FOA) audio from 360-degree (panoramic) video, devised to advance the spatial realism and directional accuracy of automatically generated audio for immersive audiovisual environments. Centered around the 360V2SA (360-Degree-Video-to-Spatial-Audio) task, OmniAudio introduces both a large-scale paired dataset (Sphere360) and a dual-branch, self-supervised model architecture that collectively achieve state-of-the-art performance on spatial audio synthesis benchmarks.
1. Task Definition: 360V2SA and FOA Representation
The 360V2SA task formalizes the problem of synthesizing spatially accurate audio from omnidirectional video:
- Input: A panoramic 360° video sequence, denoted .
- Output: A multi-channel (FOA) audio signal .
FOA audio encodes 3D directionality via first-order real spherical harmonics:
where is elevation and is azimuth. Using this encoding, FOA supplies per-sample full-sphere source localization, supporting downstream interactive spatial rendering (e.g., with head-tracked VR systems).
2. Sphere360 Dataset: Construction Methodology and Content
The Sphere360 dataset is specifically constructed for 360V2SA:
- Scale and Content: 103,000 video-audio pairs, each 10 seconds, aggregating approximately 288 hours and 288 distinct semantic event classes.
- Semi-Automated Collection Pipeline:
- Frame Stationarity: Clips with over 85% near-identical frames, as measured by MSE,
are removed. - Silent Audio Detection: Segment-based , discarding those with >90% below –35 dBFS. - Speech-Heavy Clips: Speech detector (SenseVoice); exclusion if >5 spoken words detected per clip. - Audio–Video Mismatch: Cross-modal similarity below 1 by ImageBind embedding leads to exclusion. 4. Manual Inspection: Ensures absence of artifacts and spurious pairs.
Following the above, from an initial 166,500 candidates, 103,000 high-quality pairs remain in the final dataset.
3. OmniAudio Model Architecture
OmniAudio encompasses three principal computational components, each contributing to spatially and semantically coherent audio generation.
A. Dual-Branch Video Encoding
Global (Panoramic) Branch: Processes the raw equirectangular panorama into temporal feature vectors using MetaCLIP-Huge backbone.
Local (FoV) Branch: Converts central perspective crops () into parallel feature vectors .
Fusion: Both representations are integrated via cross-attention within a Diffusion Transformer (DiT), preserving complementary global and local, perspective-invariant spatial cues.
B. FOA Audio Variational Autoencoder (Spatial VAE)
Encoder: Maps 4-channel FOA waveform to a latent sequence .
Decoder: Reconstructs FOA waveform from .
Objective:
enforcing accurate multi-channel reconstruction and regularized latent structure.
C. Self-Supervised Flow-Matching Pre-training
- Latent Path Interpolation: Between noisy prior and ground-truth latent ,
with target velocity .
- Flow-Matching Loss:
applied conditionally on video context .
- Coarse-to-Fine Curriculum:
- Coarse: Train by masking random spans in non-spatial large-audio latents to capture global audio priors.
- Fine: Specialize on unmasked FOA latents, refining spatial cue sensitivity.
D. Spatial-Aware Supervised Fine-Tuning and Inference
- Supervised Conditioning: Loss from flow matching is conditioned on both video branches and audio latent .
- Inference: The conditional ODE
is solved from noise to posterior, with final waveform reconstruction .
4. Empirical Evaluation: Metrics and Results
A. Evaluation Metrics
- Non-Spatial Quality: Fréchet Distance (FD) over OpenL3 embeddings and KL divergence on AudioSet-derived tags.
- Spatial Accuracy:
- Spatial indices:
with angular error measures
and reporting , , and .
- Subjective Evaluation: MOS-SQ (subjective quality) and MOS-AF (audio fidelity), both in mean opinion score scale with 95% CIs.
B. Performance Benchmarks
| Model | FD ↓ | KL ↓ | Δ_ang ↓ | MOS-SQ ↑ | MOS-AF ↑ |
|---|---|---|---|---|---|
| Diff-Foley + AS | 331.1 | 3.56 | – | 69.9±0.8 | 71.1±1.4 |
| MMAudio + AS | 271.2 | 2.39 | – | 75.3±1.0 | 77.6±1.2 |
| ViSAGe (FOV) | 210.9 | 2.90 | 1.49 | 73.5±1.4 | 74.9±1.7 |
| ViSAGe (360) | 219.7 | 2.96 | 1.51 | 74.1±1.2 | 75.3±1.0 |
| OmniAudio (ours) | 88.3 | 1.58 | 1.28 | 84.7±1.1 | 87.2±1.0 |
OmniAudio achieves substantial improvements across Fréchet Distance, KL divergence, angular spatial error, and both MOS metrics, demonstrating both enhanced spatial plausibility and overall audio quality on Sphere360-Bench.
C. Ablation Analyses
| Variant | FD ↓ | KL ↓ | Δ_ang ↓ |
|---|---|---|---|
| no pre-training | 104.6 | 1.83 | 1.32 |
| coarse only | 97.3 | 1.78 | 1.30 |
| fine only | 97.6 | 1.82 | 1.28 |
| coarse-to-fine | 88.3 | 1.58 | 1.28 |
The full coarse-to-fine pre-training yields the most consistent gains. For encoders,
- FOV-only: FD=88.8, KL=1.87, Δ_ang=1.33
- ERP-only: FD=97.8, KL=1.87, Δ_ang=1.28
- Combined (dual-branch): 88.3, 1.58, 1.28
This suggests information from both panoramic and perspective cues enhances spatial inference.
5. Applications and Limitations
Applications
- Virtual Reality / 360° Video Playback: Enables dynamic, spatially accurate sound fields mapped to immersive visuals, supporting interactive head-tracking.
- Immersive Film and Gaming: Simplifies production by automating the creation of spatial soundtracks, eliminating manual FOA mixing.
- Telepresence and Remote Collaboration: Provides realistic 3D sound reproduction from wide-angle acquisition, improving user experience and source localization in remote interactions.
Limitations
- Complex Acoustic Environments: In scenes with numerous simultaneous sources, the model may be less effective at disambiguating spatial cues.
- Dataset Diversity: Current Sphere360 coverage (288 classes, ~100k clips) provides diverse but not exhaustive real-world variation. Generalization to novel or rare environments may be constrained.
6. Prospects and Research Directions
Future directions center on extending the utility and expressiveness of the OmniAudio framework:
- Dataset Expansion: Ongoing semi-automated acquisition aims to augment coverage and semantic variation within Sphere360.
- Higher-Order Spatial Encoding: Adapting models for higher-order ambisonics would enable finer angular resolution in audio generation.
- Explicit Multisource Disentanglement: Investigating object-sound correspondence could allow for source separation, better handling of dense acoustic scenes, and facilitating downstream applications in audio-visual scene analysis.
A plausible implication is that incorporating explicit visual-audio correspondence modeling and further scaling up both data and model capacity may broaden applicability across more challenging real-world 360° scenes.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free