Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

OmniAudio: Spatial Audio from 360° Video

Updated 15 November 2025
  • OmniAudio is a framework that synthesizes First-Order Ambisonics audio from 360° videos by leveraging dual-branch video encoding and spatial VAE components.
  • It utilizes the large-scale Sphere360 dataset and a self-supervised flow-matching strategy to significantly enhance spatial realism and audio fidelity.
  • The approach is applicable in VR, immersive film, gaming, and telepresence, though challenges remain in handling complex acoustic environments.

OmniAudio is a framework for generating spatial First-Order Ambisonics (FOA) audio from 360-degree (panoramic) video, devised to advance the spatial realism and directional accuracy of automatically generated audio for immersive audiovisual environments. Centered around the 360V2SA (360-Degree-Video-to-Spatial-Audio) task, OmniAudio introduces both a large-scale paired dataset (Sphere360) and a dual-branch, self-supervised model architecture that collectively achieve state-of-the-art performance on spatial audio synthesis benchmarks.

1. Task Definition: 360V2SA and FOA Representation

The 360V2SA task formalizes the problem of synthesizing spatially accurate audio from omnidirectional video:

  • Input: A panoramic 360° video sequence, denoted V360V_{360}.
  • Output: A multi-channel (FOA) audio signal a(t)=[W(t),X(t),Y(t),Z(t)]\mathbf{a}(t) = [W(t), X(t), Y(t), Z(t)]^\top.

FOA audio encodes 3D directionality via first-order real spherical harmonics:

W(θ,ϕ)Y00(θ,ϕ)=12π, X(θ,ϕ)Y11(θ,ϕ)=34πsinθcosϕ, Y(θ,ϕ)Y10(θ,ϕ)=34πsinθsinϕ, Z(θ,ϕ)Y11(θ,ϕ)=34πcosθ.\begin{aligned} W(\theta,\phi) &\propto Y^0_0(\theta,\phi) = \tfrac{1}{2\sqrt\pi}, \ X(\theta,\phi) &\propto Y^{-1}_1(\theta,\phi) = \sqrt{\tfrac{3}{4\pi}}\,\sin\theta\cos\phi, \ Y(\theta,\phi) &\propto Y^{0}_1(\theta,\phi) = \sqrt{\tfrac{3}{4\pi}}\,\sin\theta\sin\phi, \ Z(\theta,\phi) &\propto Y^{1}_1(\theta,\phi) = \sqrt{\tfrac{3}{4\pi}}\,\cos\theta. \end{aligned}

where θ\theta is elevation and ϕ\phi is azimuth. Using this encoding, FOA supplies per-sample full-sphere source localization, supporting downstream interactive spatial rendering (e.g., with head-tracked VR systems).

2. Sphere360 Dataset: Construction Methodology and Content

The Sphere360 dataset is specifically constructed for 360V2SA:

  • Scale and Content: 103,000 video-audio pairs, each 10 seconds, aggregating approximately 288 hours and 288 distinct semantic event classes.
  • Semi-Automated Collection Pipeline:
    • Frame Stationarity: Clips with over 85% near-identical frames, as measured by MSE,

    MSE=1HWi=1Hj=1W(It(i,j)It+1(i,j))2,\text{MSE} = \frac{1}{HW} \sum_{i=1}^{H} \sum_{j=1}^{W} \left(I_t(i,j) - I_{t+1}(i,j)\right)^2,

    are removed. - Silent Audio Detection: Segment-based dBFS=20log10(p/pmax)\text{dBFS}=20\log_{10}(p/p_{\max}), discarding those with >90% below –35 dBFS. - Speech-Heavy Clips: Speech detector (SenseVoice); exclusion if >5 spoken words detected per clip. - Audio–Video Mismatch: Cross-modal similarity below 1 by ImageBind embedding leads to exclusion. 4. Manual Inspection: Ensures absence of artifacts and spurious pairs.

Following the above, from an initial 166,500 candidates, 103,000 high-quality pairs remain in the final dataset.

3. OmniAudio Model Architecture

OmniAudio encompasses three principal computational components, each contributing to spatially and semantically coherent audio generation.

A. Dual-Branch Video Encoding

  • Global (Panoramic) Branch: Processes the raw equirectangular panorama V360V_{360} into temporal feature vectors fgRT×Df_g \in \mathbb{R}^{T \times D} using MetaCLIP-Huge backbone.

  • Local (FoV) Branch: Converts central perspective crops (VFOVV_{\mathrm{FOV}}) into parallel feature vectors fRT×Df_\ell \in \mathbb{R}^{T \times D}.

  • Fusion: Both representations are integrated via cross-attention within a Diffusion Transformer (DiT), preserving complementary global and local, perspective-invariant spatial cues.

B. FOA Audio Variational Autoencoder (Spatial VAE)

  • Encoder: Maps 4-channel FOA waveform a(t)\mathbf{a}(t) to a latent sequence xRL×dx \in \mathbb{R}^{L \times d}.

  • Decoder: Reconstructs FOA waveform from xx.

  • Objective:

LVAE=a^a2+βKL(q(za)    N(0,I)),\mathcal{L}_\mathrm{VAE} = \|\hat{\mathbf{a}} - \mathbf{a}\|^2 + \beta\,\mathrm{KL}\left(q(z|a)\;\|\;\mathcal{N}(0, I)\right),

enforcing accurate multi-channel reconstruction and regularized latent structure.

C. Self-Supervised Flow-Matching Pre-training

  • Latent Path Interpolation: Between noisy prior x0N(0,I)x_0 \sim \mathcal{N}(0, I) and ground-truth latent x1=E(a)x_1 = E(a),

xt=tx1+(1t)x0,x_t = t x_1 + (1-t) x_0,

with target velocity u(xtx0,x1)=x1x0u(x_t|x_0,x_1) = x_1 - x_0.

  • Flow-Matching Loss:

LFM=Et,q(x0),q(x1,C)vθ(t,C,xt)(x1x0)2,\mathcal{L}_\mathrm{FM} = \mathbb{E}_{t,\,q(x_0),\,q(x_1, C)}\,\left\| v_\theta(t, C, x_t) - (x_1 - x_0) \right\|^2,

applied conditionally on video context CC.

  • Coarse-to-Fine Curriculum:
    • Coarse: Train by masking random spans in non-spatial large-audio latents to capture global audio priors.
    • Fine: Specialize on unmasked FOA latents, refining spatial cue sensitivity.

D. Spatial-Aware Supervised Fine-Tuning and Inference

  • Supervised Conditioning: Loss from flow matching is conditioned on both video branches (fg,f)(f_g, f_\ell) and audio latent xtx_t.
  • Inference: The conditional ODE

x˙t=vθ(t,fg,f,xt)\dot{x}_t = v_\theta(t, f_g, f_\ell, x_t)

is solved from noise to posterior, with final waveform reconstruction a^=D(x0)\hat{\mathbf{a}} = D(x_0).

4. Empirical Evaluation: Metrics and Results

A. Evaluation Metrics

  • Non-Spatial Quality: Fréchet Distance (FD) over OpenL3 embeddings and KL divergence on AudioSet-derived tags.
  • Spatial Accuracy:

    • Spatial indices:

    Ix=E[WX],Iy=E[WY],Iz=E[WZ],I_x = \mathbb{E}[W X],\quad I_y = \mathbb{E}[W Y],\quad I_z = \mathbb{E}[W Z],

    with angular error measures

    θ=arctan2(Iy,Ix),ϕ=arctan2(Iz,Ix2+Iy2)\theta = \arctan2(I_y, I_x),\quad \phi = \arctan2(I_z, \sqrt{I_x^2 + I_y^2})

    and reporting Δabsθ\Delta_{\rm abs}\theta, Δabsϕ\Delta_{\rm abs}\phi, and ΔAngular\Delta_{\rm Angular}.

  • Subjective Evaluation: MOS-SQ (subjective quality) and MOS-AF (audio fidelity), both in mean opinion score scale with 95% CIs.

B. Performance Benchmarks

Model FD ↓ KL ↓ Δ_ang ↓ MOS-SQ ↑ MOS-AF ↑
Diff-Foley + AS 331.1 3.56 69.9±0.8 71.1±1.4
MMAudio + AS 271.2 2.39 75.3±1.0 77.6±1.2
ViSAGe (FOV) 210.9 2.90 1.49 73.5±1.4 74.9±1.7
ViSAGe (360) 219.7 2.96 1.51 74.1±1.2 75.3±1.0
OmniAudio (ours) 88.3 1.58 1.28 84.7±1.1 87.2±1.0

OmniAudio achieves substantial improvements across Fréchet Distance, KL divergence, angular spatial error, and both MOS metrics, demonstrating both enhanced spatial plausibility and overall audio quality on Sphere360-Bench.

C. Ablation Analyses

Variant FD ↓ KL ↓ Δ_ang ↓
no pre-training 104.6 1.83 1.32
coarse only 97.3 1.78 1.30
fine only 97.6 1.82 1.28
coarse-to-fine 88.3 1.58 1.28

The full coarse-to-fine pre-training yields the most consistent gains. For encoders,

  • FOV-only: FD=88.8, KL=1.87, Δ_ang=1.33
  • ERP-only: FD=97.8, KL=1.87, Δ_ang=1.28
  • Combined (dual-branch): 88.3, 1.58, 1.28

This suggests information from both panoramic and perspective cues enhances spatial inference.

5. Applications and Limitations

Applications

  • Virtual Reality / 360° Video Playback: Enables dynamic, spatially accurate sound fields mapped to immersive visuals, supporting interactive head-tracking.
  • Immersive Film and Gaming: Simplifies production by automating the creation of spatial soundtracks, eliminating manual FOA mixing.
  • Telepresence and Remote Collaboration: Provides realistic 3D sound reproduction from wide-angle acquisition, improving user experience and source localization in remote interactions.

Limitations

  • Complex Acoustic Environments: In scenes with numerous simultaneous sources, the model may be less effective at disambiguating spatial cues.
  • Dataset Diversity: Current Sphere360 coverage (288 classes, ~100k clips) provides diverse but not exhaustive real-world variation. Generalization to novel or rare environments may be constrained.

6. Prospects and Research Directions

Future directions center on extending the utility and expressiveness of the OmniAudio framework:

  • Dataset Expansion: Ongoing semi-automated acquisition aims to augment coverage and semantic variation within Sphere360.
  • Higher-Order Spatial Encoding: Adapting models for higher-order ambisonics would enable finer angular resolution in audio generation.
  • Explicit Multisource Disentanglement: Investigating object-sound correspondence could allow for source separation, better handling of dense acoustic scenes, and facilitating downstream applications in audio-visual scene analysis.

A plausible implication is that incorporating explicit visual-audio correspondence modeling and further scaling up both data and model capacity may broaden applicability across more challenging real-world 360° scenes.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to OmniAudio Technologies.