Papers
Topics
Authors
Recent
2000 character limit reached

Audio-Visual World Models (AVWM)

Updated 7 December 2025
  • Audio-Visual World Models (AVWM) are computational frameworks that bridge auditory and visual data to accurately model the dynamics and structure of complex environments.
  • They employ modality-specific encoders and fusion methods, enabling improved spatial reasoning and agent navigation through realistic simulation scenarios.
  • AVWM training leverages staged curricula and specialized datasets to boost simulation fidelity and extend applications from synthetic to real-world settings.

Audio-Visual World Models (AVWM) are a class of computational frameworks that jointly model the dynamics, structure, and semantics of environments using both audio and visual sensing. By integrating synchronized audio and vision modalities, AVWMs enable agents and systems to perform spatial reasoning, environmental simulation, semantic understanding, and interactive planning with a fidelity unattainable by visual-only or audio-only approaches. Recent research has established formal definitions, benchmarked architectures, and introduced new datasets that collectively advance AVWM as an essential paradigm for multisensory machine perception (Wang et al., 30 Nov 2025, Sagare et al., 21 Jul 2024, Liang et al., 2023, Purushwalkam et al., 2020).

1. Formalization and Core Problem Statement

The formalization of AVWM is anchored in the framework of partially observable Markov decision processes (POMDPs) that are extended to the multisensory regime (Wang et al., 30 Nov 2025). An AVWM models a tuple (S,A,O,p,r)(\mathcal{S}, \mathcal{A}, \mathcal{O}, p, r):

  • S\mathcal{S}: latent (unobserved) states, e.g., the true configuration of the environment.
  • A\mathcal{A}: agent action space, often including translation and rotation.
  • O=Ov×Oa\mathcal{O} = \mathcal{O}_v \times \mathcal{O}_a: observation space comprising synchronized visual (otvo^v_t) and audio (otao^a_t) signals.
  • p(st+1st,at)p(s_{t+1}|s_t,a_t): transition dynamics.
  • r(st,at)r(s_t,a_t): task-driven reward, e.g., based on proximity to an audio-emitting source.

At each step, the AVWM aims to predict future synchronized audio-visual observations and rewards conditioned on a recent history and a planned action sequence: (o^t+Δt,r^t+Δt)pθ(ot+Δt,rt+Δtotm+1:t,att+Δt,Δt)(\hat o_{t+\Delta t}, \hat r_{t+\Delta t}) \sim p_{\theta}(o_{t+\Delta t}, r_{t+\Delta t} \mid o_{t-m+1:t}, a_{t\rightarrow t+\Delta t}, \Delta t)

This formulation generalizes classical visual world models to embrace the multimodal nature of real-world perception, leveraging spatial and temporal cues from audio that are critical for occlusion reasoning, scene completion, and agent policy learning (Wang et al., 30 Nov 2025, Purushwalkam et al., 2020).

2. Architectural Paradigms for AVWM

A diversity of AVWM architectures have been introduced to address simulation, grounding, and inference:

  • Modality-specific encoders: Separate branches process audio and visual streams, typically using transformer-based (or CNN-based) backbones for each modality. For instance, Whisper for audio and sigLIP or Vision Transformers for vision (Sagare et al., 21 Jul 2024, Wang et al., 30 Nov 2025).
  • Latent space projection and alignment: Modality-specific features are projected to a shared latent space using MLPs, facilitating joint modeling within a unified Transformer or diffusion framework (Sagare et al., 21 Jul 2024, Wang et al., 30 Nov 2025).
  • Fusion mechanisms: Common strategies include concatenation of modality tokens (allowing joint self-attention), explicit cross-modal attention, or modality-expert feedforward layers to preserve specific inductive biases while enabling multimodal integration (Wang et al., 30 Nov 2025).
  • Generative modeling: Recent work introduces diffusion-based transformers (e.g., AV-CDiT) that operate autoregressively in latent token space for sight, sound, and reward simulation under action control; spatialized audio generation is achieved by learning acoustic fields consistent with the 3D geometry and listener pose (Liang et al., 2023, Wang et al., 30 Nov 2025).

Table 1: Representative AVWM Architectural Components

Model/System Audio Encoder Visual Encoder Fusion/Backbone
AV-CDiT AudioEnc + Adapter VAEEnc + Adapter Diffusion Transformer
Video-Text LLM AVWM Whisper sigLIP φ-2 LLM (autoreg.)
AV-NeRF A-NeRF (MLPs) V-NeRF (MLP) Joint NeRF geometry
AV-Map Sound-event (1D NN) ResNet-18 ConvSelfAttention-UNet

3. Training Strategies and Datasets

AVWM training necessitates synchronized, high-quality audio-visual data, explicit reward signals (for agent-centric tasks), and specialized objectives reflecting the multimodal nature of prediction.

  • AVW-4k dataset (Wang et al., 30 Nov 2025): 30 hours of agent-centered navigation within 76 synthetic indoor scenes, offering binaural audio at 16 kHz, low-res RGB frames, fine-grained navigation actions, and per-step reward (geodesic decrease to audio source).
  • VideoInstruct100K dataset (Sagare et al., 21 Jul 2024): 100,000 video-QA pairs (video, audio, text), enabling instruction-tuning of video-text LLMs with audio.
  • RWAVS dataset (Liang et al., 2023): Real-world AV trajectories (office, house, apartment, outdoor), with synchronized binaural audio, camera pose, and mono/stereo signals.
  • Matterport/SoundSpaces, AV-Map (Purushwalkam et al., 2020): Real and synthetic multichannel audio-visual data for floorplan inference.

Progressive training: Stage-wise curriculum stabilizes multimodal learning—visual-only pretraining, audio/reward-specific adaptation, then joint fine-tuning accelerates convergence and mitigates catastrophic forgetting (Wang et al., 30 Nov 2025). Losses include standard cross-entropy (for language modeling in LLM AVWM), 2\ell_2 prediction over latent tokens in diffusion-based models, and photometric/spectrogram/semantic pixelwise losses for NeRF/UNet-based approaches (Sagare et al., 21 Jul 2024, Wang et al., 30 Nov 2025, Liang et al., 2023, Purushwalkam et al., 2020).

4. Domains of Application and Evaluation

4.1. Audio-Visual Navigation and Simulation

Precise simulation of both visual dynamics and spatial audio is essential for navigation agents to predict unseen states (sight and sound) under hypothetical action sequences. AV-CDiT demonstrates high-fidelity multimodal imagination, reaching LPIPS↓0.38 (visual), Log-spectral distance↓1.31 (audio), and MSE↓0.75 (reward, 16-step rollout) on AVW-4k (Wang et al., 30 Nov 2025). Integration of AVWM into lookahead beam-search planners enhances navigation metrics, raising success (SPL) and reducing average step count relative to audio-agnostic or visual-only baselines.

4.2. Audio-Visual Grounding and Video-Text Comprehension

Instructed Audio-Visual World Models for video-text LLMs enable finer-grained grounding by leveraging audio cues in response generation. On human-annotated audio-visual QA benchmarks, adding audio improves “correctness” (2.77 vs. 2.34), “contextual” (3.04 vs. 2.75), and “temporal” (2.40 vs. 2.17) ratings (1–5 scale), outperforming vision-only and prior audio-visual models. Explicit audio-visual data exposure, even without specialized alignment loss, yields substantial gains in detail, context, and temporal reasoning (Sagare et al., 21 Jul 2024).

4.3. Scene Synthesis, Mapping, and Completion

AVWM extend to simulating plausible multi-sensory experiences across new agent trajectories and inferring environment structure. AV-NeRF achieves joint audio-visual scene synthesis with matching binaural audio at novel poses, outperforming prior methods in both magnitude and envelope spectrogram metrics (MAG↓1.50, ENV↓0.145 on RWAVS) (Liang et al., 2023). AV-Map fuses egocentric vision and ambient audio to infer large-scale 2D floorplans and semantic room labels, surpassing vision-only baselines by +8–13 AP in area and room-type recovery (Purushwalkam et al., 2020).

5. Inductive Biases, Representational Mechanisms, and Limitations

AVWM research leverages domain-specific inductive biases:

  • Acoustic propagation priors: Distance attenuation, head-related transfer functions encoded implicitly via frequency-wise masks (AV-NeRF) or explicitly through binaural spatialization (Liang et al., 2023).
  • Spatial top-down feature alignment: AV-Map aligns multimodal observations into a global metric frame using pose and positional encoding, crucial for mapping and coherent cross-modal reasoning (Purushwalkam et al., 2020).
  • Latent space modularity: Modality expert layers in diffusion Transformers enable independent and joint nonlinearities for each modality, mitigating performance imbalance and preserving unimodal quality (Wang et al., 30 Nov 2025).

Principal limitations include reliance on synthetic datasets, simplifications such as fixed or single sound sources, per-scene model retraining, and the need for stronger long-horizon reasoning (beyond 16 rollouts). Coarse frame-level A/V alignment and the absence of explicit cross-modal alignment objectives remain open challenges (Wang et al., 30 Nov 2025, Liang et al., 2023, Sagare et al., 21 Jul 2024).

6. Prospects and Future Directions

Emerging research directions call for:

A plausible implication is that future AVWM architectures will be fundamental not just for embodied agent navigation or mapping, but for general-purpose artificial intelligence capable of genuinely multisensory imagination and reasoning in complex environments.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Audio-Visual World Models (AVWM).