OmniAudio: Spatial Audio from 360° Video

Updated 15 November 2025

OmniAudio is a framework that synthesizes First-Order Ambisonics audio from 360° videos by leveraging dual-branch video encoding and spatial VAE components.
It utilizes the large-scale Sphere360 dataset and a self-supervised flow-matching strategy to significantly enhance spatial realism and audio fidelity.
The approach is applicable in VR, immersive film, gaming, and telepresence, though challenges remain in handling complex acoustic environments.

OmniAudio is a framework for generating spatial First-Order Ambisonics (FOA) audio from 360-degree (panoramic) video, devised to advance the spatial realism and directional accuracy of automatically generated audio for immersive audiovisual environments. Centered around the 360V2SA (360-Degree-Video-to-Spatial-Audio) task, OmniAudio introduces both a large-scale paired dataset (Sphere360) and a dual-branch, self-supervised model architecture that collectively achieve state-of-the-art performance on spatial audio synthesis benchmarks.

1. Task Definition: 360V2SA and FOA Representation

The 360V2SA task formalizes the problem of synthesizing spatially accurate audio from omnidirectional video:

Input: A panoramic 360° video sequence, denoted $V_{360}$ .
Output: A multi-channel (FOA) audio signal $\mathbf{a}(t) = [W(t), X(t), Y(t), Z(t)]^\top$ .

FOA audio encodes 3D directionality via first-order real spherical harmonics:

$\begin{aligned} W(\theta,\phi) &\propto Y^0_0(\theta,\phi) = \tfrac{1}{2\sqrt\pi}, \ X(\theta,\phi) &\propto Y^{-1}_1(\theta,\phi) = \sqrt{\tfrac{3}{4\pi}}\,\sin\theta\cos\phi, \ Y(\theta,\phi) &\propto Y^{0}_1(\theta,\phi) = \sqrt{\tfrac{3}{4\pi}}\,\sin\theta\sin\phi, \ Z(\theta,\phi) &\propto Y^{1}_1(\theta,\phi) = \sqrt{\tfrac{3}{4\pi}}\,\cos\theta. \end{aligned}$

where $\theta$ is elevation and $\phi$ is azimuth. Using this encoding, FOA supplies per-sample full-sphere source localization, supporting downstream interactive spatial rendering (e.g., with head-tracked VR systems).

2. Sphere360 Dataset: Construction Methodology and Content

The Sphere360 dataset is specifically constructed for 360V2SA:

Scale and Content: 103,000 video-audio pairs, each 10 seconds, aggregating approximately 288 hours and 288 distinct semantic event classes.
Semi-Automated Collection Pipeline:
- Frame Stationarity: Clips with over 85% near-identical frames, as measured by MSE,
$\text{MSE} = \frac{1}{HW} \sum_{i=1}^{H} \sum_{j=1}^{W} \left(I_t(i,j) - I_{t+1}(i,j)\right)^2,$

are removed. - Silent Audio Detection: Segment-based $\text{dBFS}=20\log_{10}(p/p_{\max})$ , discarding those with >90% below –35 dBFS. - Speech-Heavy Clips: Speech detector (SenseVoice); exclusion if >5 spoken words detected per clip. - Audio–Video Mismatch: Cross-modal similarity below 1 by ImageBind embedding leads to exclusion. 4. Manual Inspection: Ensures absence of artifacts and spurious pairs.

Following the above, from an initial 166,500 candidates, 103,000 high-quality pairs remain in the final dataset.

3. OmniAudio Model Architecture

OmniAudio encompasses three principal computational components, each contributing to spatially and semantically coherent audio generation.

A. Dual-Branch Video Encoding

Global (Panoramic) Branch: Processes the raw equirectangular panorama $V_{360}$ into temporal feature vectors $f_g \in \mathbb{R}^{T \times D}$ using MetaCLIP-Huge backbone.
Local (FoV) Branch: Converts central perspective crops ( $V_{\mathrm{FOV}}$ ) into parallel feature vectors $f_\ell \in \mathbb{R}^{T \times D}$ .
Fusion: Both representations are integrated via cross-attention within a Diffusion Transformer (DiT), preserving complementary global and local, perspective-invariant spatial cues.

B. FOA Audio Variational Autoencoder (Spatial VAE)

Encoder: Maps 4-channel FOA waveform $\mathbf{a}(t)$ to a latent sequence $x \in \mathbb{R}^{L \times d}$ .
Decoder: Reconstructs FOA waveform from $x$ .
Objective:

$\mathcal{L}_\mathrm{VAE} = \|\hat{\mathbf{a}} - \mathbf{a}\|^2 + \beta\,\mathrm{KL}\left(q(z|a)\;\|\;\mathcal{N}(0, I)\right),$

enforcing accurate multi-channel reconstruction and regularized latent structure.

C. Self-Supervised Flow-Matching Pre-training

Latent Path Interpolation: Between noisy prior $x_0 \sim \mathcal{N}(0, I)$ and ground-truth latent $x_1 = E(a)$ ,

$x_t = t x_1 + (1-t) x_0,$

with target velocity $u(x_t|x_0,x_1) = x_1 - x_0$ .

Flow-Matching Loss:

$\mathcal{L}_\mathrm{FM} = \mathbb{E}_{t,\,q(x_0),\,q(x_1, C)}\,\left\| v_\theta(t, C, x_t) - (x_1 - x_0) \right\|^2,$

applied conditionally on video context $C$ .

Coarse-to-Fine Curriculum:
- Coarse: Train by masking random spans in non-spatial large-audio latents to capture global audio priors.
- Fine: Specialize on unmasked FOA latents, refining spatial cue sensitivity.

D. Spatial-Aware Supervised Fine-Tuning and Inference

Supervised Conditioning: Loss from flow matching is conditioned on both video branches $(f_g, f_\ell)$ and audio latent $x_t$ .
Inference: The conditional ODE

$\dot{x}_t = v_\theta(t, f_g, f_\ell, x_t)$

is solved from noise to posterior, with final waveform reconstruction $\hat{\mathbf{a}} = D(x_0)$ .

4. Empirical Evaluation: Metrics and Results

A. Evaluation Metrics

Non-Spatial Quality: Fréchet Distance (FD) over OpenL3 embeddings and KL divergence on AudioSet-derived tags.
Spatial Accuracy:
- Spatial indices:
$I_x = \mathbb{E}[W X],\quad I_y = \mathbb{E}[W Y],\quad I_z = \mathbb{E}[W Z],$

with angular error measures

$\theta = \arctan2(I_y, I_x),\quad \phi = \arctan2(I_z, \sqrt{I_x^2 + I_y^2})$

and reporting $\Delta_{\rm abs}\theta$ , $\Delta_{\rm abs}\phi$ , and $\Delta_{\rm Angular}$ .
Subjective Evaluation: MOS-SQ (subjective quality) and MOS-AF (audio fidelity), both in mean opinion score scale with 95% CIs.

B. Performance Benchmarks

Model	FD ↓	KL ↓	Δ_ang ↓	MOS-SQ ↑	MOS-AF ↑
Diff-Foley + AS	331.1	3.56	–	69.9±0.8	71.1±1.4
MMAudio + AS	271.2	2.39	–	75.3±1.0	77.6±1.2
ViSAGe (FOV)	210.9	2.90	1.49	73.5±1.4	74.9±1.7
ViSAGe (360)	219.7	2.96	1.51	74.1±1.2	75.3±1.0
OmniAudio (ours)	88.3	1.58	1.28	84.7±1.1	87.2±1.0

OmniAudio achieves substantial improvements across Fréchet Distance, KL divergence, angular spatial error, and both MOS metrics, demonstrating both enhanced spatial plausibility and overall audio quality on Sphere360-Bench.

C. Ablation Analyses

Variant	FD ↓	KL ↓	Δ_ang ↓
no pre-training	104.6	1.83	1.32
coarse only	97.3	1.78	1.30
fine only	97.6	1.82	1.28
coarse-to-fine	88.3	1.58	1.28

The full coarse-to-fine pre-training yields the most consistent gains. For encoders,

FOV-only: FD=88.8, KL=1.87, Δ_ang=1.33
ERP-only: FD=97.8, KL=1.87, Δ_ang=1.28
Combined (dual-branch): 88.3, 1.58, 1.28

This suggests information from both panoramic and perspective cues enhances spatial inference.

5. Applications and Limitations

Applications

Virtual Reality / 360° Video Playback: Enables dynamic, spatially accurate sound fields mapped to immersive visuals, supporting interactive head-tracking.
Immersive Film and Gaming: Simplifies production by automating the creation of spatial soundtracks, eliminating manual FOA mixing.
Telepresence and Remote Collaboration: Provides realistic 3D sound reproduction from wide-angle acquisition, improving user experience and source localization in remote interactions.

Limitations

Complex Acoustic Environments: In scenes with numerous simultaneous sources, the model may be less effective at disambiguating spatial cues.
Dataset Diversity: Current Sphere360 coverage (288 classes, ~100k clips) provides diverse but not exhaustive real-world variation. Generalization to novel or rare environments may be constrained.

6. Prospects and Research Directions

Future directions center on extending the utility and expressiveness of the OmniAudio framework:

Dataset Expansion: Ongoing semi-automated acquisition aims to augment coverage and semantic variation within Sphere360.
Higher-Order Spatial Encoding: Adapting models for higher-order ambisonics would enable finer angular resolution in audio generation.
Explicit Multisource Disentanglement: Investigating object-sound correspondence could allow for source separation, better handling of dense acoustic scenes, and facilitating downstream applications in audio-visual scene analysis.

A plausible implication is that incorporating explicit visual-audio correspondence modeling and further scaling up both data and model capacity may broaden applicability across more challenging real-world 360° scenes.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to OmniAudio Technologies.