Papers
Topics
Authors
Recent
2000 character limit reached

Stereo Audio-Video Generation

Updated 17 December 2025
  • Stereo audio-video generation is the process of synthesizing spatially coherent, temporally aligned stereo audio from visual inputs, ensuring audio cues accurately reflect scene geometry.
  • The field leverages advanced methods such as diffusion transformers, cross-attention U-Nets, and causal autoregressive models for immersive VR, telepresence, and multimedia applications.
  • Key challenges include achieving semantic channel separation, balancing spatial width with signal fidelity, and developing robust evaluation metrics and datasets for spatial alignment.

Stereo audio-video generation refers to the set of computational methodologies and models that synthesize temporally synchronized, spatially coherent stereo or binaural audio tracks from visual inputs (video), often augmented by other modalities such as text or monaural audio. Unlike mono audio generation, which produces a single audio channel, stereo generation requires precise modeling of spatial audio cues (interaural level and time differences, localization, and panning) to ensure that auditory events in the generated audio are aligned with the spatial position and motion of sound-producing objects in the corresponding video frames. This enables immersive experiences in settings such as multimedia authoring, virtual reality, robotics, and telepresence systems.

1. Problem Definition and Research Landscape

Stereo audio-video generation formalizes a suite of tasks where, given video frames (and optionally text or mono audio), the model synthesizes left-right channel waveforms whose spatial cues correspond with the visual content. The field encompasses several problem settings:

A central challenge is achieving robust spatial alignment, ensuring that the auditory scene presented to listeners matches the geometric and semantic properties of the visual scene. Misalignment (for example, sound emanating from a visually mismatched direction) breaks immersion and reduces plausibility—especially critical for applications in VR, telepresence, and robotic interaction (Shimada et al., 18 Dec 2024).

The landscape has evolved from early discriminative fusion models (e.g. U-Nets with visual branching (Zhou et al., 2020)) to generative adversarial approaches (Li et al., 2023), then to large-scale latent diffusion transformers and flow-matching architectures that can scale to long sequences, diverse sound modalities, and industrial-scale datasets (Li et al., 29 Dec 2024, Wang et al., 24 Jun 2025, Karchkhadze et al., 22 Sep 2025).

2. Model Architectures and Fusion Mechanisms

The architectural paradigm is dominated by multimodal encoders, latent diffusion backbones, and fusion blocks that align spatial cues across video and audio. Notable instantiations include:

Spatial-Audio Conditioned Diffusion Models:

  • CCStereo (Chen et al., 6 Jan 2025) uses ResNet visual encoders, U-Net audio encoder/decoder, plus a cross-attention-based fusion at the audio bottleneck. The Audio-Visual Adaptive De-normalization (AVAD) layer modulates the U-Net's batch normalization statistics according to visual context, injecting semantic audio-visual alignment directly into the feature decoding path.
  • StereoSync leverages depth maps and object bounding-box sequences extracted from video, projecting them into UNet cross-attention layers for a frozen latent diffusion audio generator, significantly improving AV-Align metrics and spatial correspondence to moving objects (Marinoni et al., 7 Oct 2025).

Unified Multimodal Transformers:

  • UniForm (Zhao et al., 6 Feb 2025) and Kling-Foley (Wang et al., 24 Jun 2025) employ single diffusion transformer backbones that process audio and video tokens in a joint space, with specialized conditioning tokens and noise schedules for each sub-task. Extension to stereo is achieved by parallel or concatenated latent representations for left/right channels, with cross-modal attention learning dependencies between all modalities.
  • Tri-Ergon (Li et al., 29 Dec 2024) introduces fine-grained per-channel loudness control (LUFS embedding), supporting not only spatialization but precise temporal control over loudness trajectories in stereo outputs.

Causal and Real-Time Systems:

  • SoundReactor (Saito et al., 2 Oct 2025) implements fully causal autoregressive video-conditioned stereo generation, suitable for live or online settings. Here, stereo VAE encoded audio latents are predicted framewise by a Transformer with strict causality and low latency guarantees (<33 ms per frame at 30 FPS).

Object-Aware Diffusion:

  • StereoFoley (Karchkhadze et al., 22 Sep 2025) incorporates object tracking and segmentation into a synthetic data pipeline, allowing the model to associate moving objects' tracks with corresponding panned and attenuated audio, resulting in true object-aware stereo generation and measurable bin-alignment scores.

A comparative summary of principal architectural features:

Model Core Backbone Stereo Mechanism Key Fusion/Conditioning
CCStereo U-Net, AVAD Layer STFT-Diff Mask, visual norm Cross-modal CA, visual-normalization
StereoSync Latent Diffusion (U-Net) VAE, direct stereo latents Depth & bbox CA
Tri-Ergon DiT (Diff Trans) 44.1kHz stereo VAE LUFS, multi-modal CA
Kling-Foley MM-DiT + FLUX Universal audio codec Joint RoPE CA, mono-stereo rendering
SoundReactor Causal AR+Diffusion Stereo VAE(latent) DINOv2 frame features, AR fusion
StereoFoley Diffusion Transformer Learned stereo codec Object-centric, panning in synth data

3. Datasets and Benchmarks for Spatial AV Generation

Research progress is closely linked to the availability and structure of datasets and standardized evaluation protocols.

The recent VABench framework (Hua et al., 10 Dec 2025) adds 15 evaluation dimensions, including a dedicated stereo track with nine reference-free spatial-imaging and signal-fidelity metrics, as well as QA pairs designed for semantic placement in stereo.

4. Evaluation Metrics for Spatial and Temporal Alignment

Robust assessment requires metrics that quantify not only audio and video fidelity, but spatial consistency. Key metrics include:

Human listening studies complement these quantitative metrics, especially for perceptual realism, semantic source separation, and stereo plausibility (Li et al., 2023, Karchkhadze et al., 22 Sep 2025, Saito et al., 2 Oct 2025, Hua et al., 10 Dec 2025).

5. Key Advances, Limitations, and Lessons from Recent Models

Models such as CCStereo (Chen et al., 6 Jan 2025), SAGM (Li et al., 2023), and Sep-Stereo (Zhou et al., 2020) establish the necessity of fusing spatio-temporal video features at multiple scales, with conditional normalization and/or adversarial objectives to enforce spatial realism. Unified diffusion transformers (e.g. Kling-Foley (Wang et al., 24 Jun 2025), Tri-Ergon (Li et al., 29 Dec 2024)) extend to large-scale, multi-task, and multi-modal scenarios, leveraging universal stereo audio codecs, adaptive layer normalization, and explicit stereo rendering heads.

Evaluation on FAIR-Play, MUSIC, and SVGSA24 benchmarks demonstrates:

  • Object-aware stereo imaging is only reliably achieved with models that either (a) inject explicit object tracking and spatialization at training (e.g. synthetic data in StereoFoley (Karchkhadze et al., 22 Sep 2025)) or (b) employ fine-grained visual-convolutional conditioning (AVAD, cross-attention with localization features) (Chen et al., 6 Jan 2025, Marinoni et al., 7 Oct 2025).
  • General-purpose models (Veo3, Sora2, Wan2.5, Kling-Foley) can match and sometimes exceed human-level technical metrics (phase coherence, mono compatibility), but semantic left/right separation remains inconsistent in practice (Hua et al., 10 Dec 2025).
  • Stereo audio-video generation is more sensitive to spatial failings—such as mislocalization, excessive panning symmetry, or collapsed stereo width—than mono audio-video tasks, as evidenced by VABench analyses (Hua et al., 10 Dec 2025).

Current systems generally outperform mono or non-spatial baselines in quantitative alignment and perceptual MOS, but explicit spatial object-awareness is best achieved by task-specific data augmentation or synthetic generation pipelines (Karchkhadze et al., 22 Sep 2025). A plausible implication is that future breakthroughs will require richer annotated datasets, improved spatial supervision (e.g., HRTF/binaural rendering), and task-driven architectural innovations.

6. Open Challenges and Future Directions

VABench (Hua et al., 10 Dec 2025), SAVGBench (Shimada et al., 18 Dec 2024), and recent reviews articulate several outstanding technical gaps and research priorities:

  • Semantic Channel Separation: No current end-to-end system reliably generates distinct audio sources assigned to specific left/right visual positions in response to explicit conditioning (e.g., "sound A on left, sound B on right").
  • Spatial Width vs. Signal Fidelity Trade-off: Models with the highest stereo width often sacrifice signal integrity and vice versa; balancing these remains an open objective (Hua et al., 10 Dec 2025).
  • Explicit Geometric and Physical Modeling: Most architectures lack explicit spatial audio representations (e.g., angle-aware, ambisonic latents) or geometric constraints tying audio localization to detected object positions (Shimada et al., 18 Dec 2024).
  • Human Evaluation and Perceptual Testing: Incorporation of realistic binaural rendering, HRTFs, and personalized perceptual assessment (with headphones or VR) is needed to close the evaluation loop (Hua et al., 10 Dec 2025).
  • Benchmarking and QA: Dynamic QA-style evaluation—verifying that sound trajectories match video events, directions, and spatio-temporal cues—will enable more granular progress (Hua et al., 10 Dec 2025).
  • Live and Causal Generation: Real-time applications require causal, efficient architectures (e.g., SoundReactor (Saito et al., 2 Oct 2025)) without degrading spatial and semantic quality.
  • Long-Range Multi-Event Scenarios: Most state-of-the-art models operate on short clips (≤10 s). Scaling to long-duration, multi-object, and real-world scenes is necessary for deployment.

These directions are being actively pursued via synthetic data generation, improved synchronization modules (e.g., SynchFormer in Kling-Foley (Wang et al., 24 Jun 2025)), stereo-aware loss functions, and cross-modal geometry-aware scheduling (Marinoni et al., 7 Oct 2025, Karchkhadze et al., 22 Sep 2025).

7. Summary Table: Major Models and Their Benchmarking

Model Key Stereo Mechanism Notable Benchmark Results AV-Align/Spatial Alignment
CCStereo (Chen et al., 6 Jan 2025) AVAD, U-Net, Cross-Attn STFT 0.823, SNR 7.144 (FAIR-Play/10split) SOTA SNR and error on FAIR-Play, MUSIC
SAGM (Li et al., 2023) Visually guided GAN STFT 0.851, SNR 7.044 (FAIR-Play) SPL-Distance correlates with MOS
Sep-Stereo (Zhou et al., 2020) APNet, source-separation fusion STFT 0.879 / ENV 0.135 (FAIR-Play) Unified multi-task with separation
StereoSync (Marinoni et al., 7 Oct 2025) Depth/Box CA, LDM backbone AV-Align 0.78, FAD 0.230, E-L1 0.047 Robust spatial tracking w/ motion
Kling-Foley (Wang et al., 24 Jun 2025) MM-DiT+Flux, universal stereo codec FD 7.60, IB 30.75, DeSync 0.43 (VGGSound) SOTA semantic alignment
Tri-Ergon (Li et al., 29 Dec 2024) DiT, LUFS per-channel control FD_openl3 113.21, AV-Align 0.231 (MM-V2A) Fine-grained loudness, 44.1kHz stereo
StereoFoley (Karchkhadze et al., 22 Sep 2025) Object-aware gen. + synthetic data BAS 0.33, MOS 3.46 (VGG-obj) SOTA object-stereo correspondence
SoundReactor (Saito et al., 2 Oct 2025) Causal AR+diffusion, real-time Stereo FAD/MMD/FSAD competitive, 26.3ms latency Online/causal, meets real-time budget

This table encapsulates the core structural and performance dimensions of current research in stereo audio-video generation, providing direct links to both architectural innovation and quantitative benchmarks.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Stereo Audio-Video Generation.