Stereo Audio-Video Generation
- Stereo audio-video generation is the process of synthesizing spatially coherent, temporally aligned stereo audio from visual inputs, ensuring audio cues accurately reflect scene geometry.
- The field leverages advanced methods such as diffusion transformers, cross-attention U-Nets, and causal autoregressive models for immersive VR, telepresence, and multimedia applications.
- Key challenges include achieving semantic channel separation, balancing spatial width with signal fidelity, and developing robust evaluation metrics and datasets for spatial alignment.
Stereo audio-video generation refers to the set of computational methodologies and models that synthesize temporally synchronized, spatially coherent stereo or binaural audio tracks from visual inputs (video), often augmented by other modalities such as text or monaural audio. Unlike mono audio generation, which produces a single audio channel, stereo generation requires precise modeling of spatial audio cues (interaural level and time differences, localization, and panning) to ensure that auditory events in the generated audio are aligned with the spatial position and motion of sound-producing objects in the corresponding video frames. This enables immersive experiences in settings such as multimedia authoring, virtual reality, robotics, and telepresence systems.
1. Problem Definition and Research Landscape
Stereo audio-video generation formalizes a suite of tasks where, given video frames (and optionally text or mono audio), the model synthesizes left-right channel waveforms whose spatial cues correspond with the visual content. The field encompasses several problem settings:
- Monaural-to-stereo generation: e.g., "binaural audio generation" from mono audio guided by video (Chen et al., 6 Jan 2025, Li et al., 2023, Zhou et al., 2020).
- Direct video-to-stereo-audio (V2A): e.g., produce stereo sound from silent video (Li et al., 29 Dec 2024, Karchkhadze et al., 22 Sep 2025, Wang et al., 24 Jun 2025).
- Joint audio-video generation: e.g., generate both video and stereo audio from text or weak conditioning (Shimada et al., 18 Dec 2024, Zhao et al., 6 Feb 2025, Hua et al., 10 Dec 2025).
A central challenge is achieving robust spatial alignment, ensuring that the auditory scene presented to listeners matches the geometric and semantic properties of the visual scene. Misalignment (for example, sound emanating from a visually mismatched direction) breaks immersion and reduces plausibility—especially critical for applications in VR, telepresence, and robotic interaction (Shimada et al., 18 Dec 2024).
The landscape has evolved from early discriminative fusion models (e.g. U-Nets with visual branching (Zhou et al., 2020)) to generative adversarial approaches (Li et al., 2023), then to large-scale latent diffusion transformers and flow-matching architectures that can scale to long sequences, diverse sound modalities, and industrial-scale datasets (Li et al., 29 Dec 2024, Wang et al., 24 Jun 2025, Karchkhadze et al., 22 Sep 2025).
2. Model Architectures and Fusion Mechanisms
The architectural paradigm is dominated by multimodal encoders, latent diffusion backbones, and fusion blocks that align spatial cues across video and audio. Notable instantiations include:
Spatial-Audio Conditioned Diffusion Models:
- CCStereo (Chen et al., 6 Jan 2025) uses ResNet visual encoders, U-Net audio encoder/decoder, plus a cross-attention-based fusion at the audio bottleneck. The Audio-Visual Adaptive De-normalization (AVAD) layer modulates the U-Net's batch normalization statistics according to visual context, injecting semantic audio-visual alignment directly into the feature decoding path.
- StereoSync leverages depth maps and object bounding-box sequences extracted from video, projecting them into UNet cross-attention layers for a frozen latent diffusion audio generator, significantly improving AV-Align metrics and spatial correspondence to moving objects (Marinoni et al., 7 Oct 2025).
Unified Multimodal Transformers:
- UniForm (Zhao et al., 6 Feb 2025) and Kling-Foley (Wang et al., 24 Jun 2025) employ single diffusion transformer backbones that process audio and video tokens in a joint space, with specialized conditioning tokens and noise schedules for each sub-task. Extension to stereo is achieved by parallel or concatenated latent representations for left/right channels, with cross-modal attention learning dependencies between all modalities.
- Tri-Ergon (Li et al., 29 Dec 2024) introduces fine-grained per-channel loudness control (LUFS embedding), supporting not only spatialization but precise temporal control over loudness trajectories in stereo outputs.
Causal and Real-Time Systems:
- SoundReactor (Saito et al., 2 Oct 2025) implements fully causal autoregressive video-conditioned stereo generation, suitable for live or online settings. Here, stereo VAE encoded audio latents are predicted framewise by a Transformer with strict causality and low latency guarantees (<33 ms per frame at 30 FPS).
Object-Aware Diffusion:
- StereoFoley (Karchkhadze et al., 22 Sep 2025) incorporates object tracking and segmentation into a synthetic data pipeline, allowing the model to associate moving objects' tracks with corresponding panned and attenuated audio, resulting in true object-aware stereo generation and measurable bin-alignment scores.
A comparative summary of principal architectural features:
| Model | Core Backbone | Stereo Mechanism | Key Fusion/Conditioning |
|---|---|---|---|
| CCStereo | U-Net, AVAD Layer | STFT-Diff Mask, visual norm | Cross-modal CA, visual-normalization |
| StereoSync | Latent Diffusion (U-Net) | VAE, direct stereo latents | Depth & bbox CA |
| Tri-Ergon | DiT (Diff Trans) | 44.1kHz stereo VAE | LUFS, multi-modal CA |
| Kling-Foley | MM-DiT + FLUX | Universal audio codec | Joint RoPE CA, mono-stereo rendering |
| SoundReactor | Causal AR+Diffusion | Stereo VAE(latent) | DINOv2 frame features, AR fusion |
| StereoFoley | Diffusion Transformer | Learned stereo codec | Object-centric, panning in synth data |
3. Datasets and Benchmarks for Spatial AV Generation
Research progress is closely linked to the availability and structure of datasets and standardized evaluation protocols.
- FAIR-Play: Professional binaural music recordings, up to 1,871 10s clips with accompanying video. Used widely for stereo/binaural evaluation (Chen et al., 6 Jan 2025, Zhou et al., 2020, Li et al., 2023).
- MUSIC-Stereo, YT-Music: YouTube/ambisonic collections, often resynthesized to stereo via HRTF (Chen et al., 6 Jan 2025, Li et al., 2023).
- SVGSA24: Introduced in SAVGBench (Shimada et al., 18 Dec 2024), derived from STARSS23, explicitly aligns stereo audio with annotated sound event locations in video for true spatial AV ground-truth.
- VGGSound, MM-V2A, Kling-Audio-Eval: Large-scale, diverse cross-modal datasets supporting evaluation across event classes, content domains, variable duration, and stereo structure (Li et al., 29 Dec 2024, Wang et al., 24 Jun 2025).
- Walking The Maps: Game engine based, tracking movement and environmental sounds with clean visual trajectories for evaluating explicit spatial alignment (Marinoni et al., 7 Oct 2025).
The recent VABench framework (Hua et al., 10 Dec 2025) adds 15 evaluation dimensions, including a dedicated stereo track with nine reference-free spatial-imaging and signal-fidelity metrics, as well as QA pairs designed for semantic placement in stereo.
4. Evaluation Metrics for Spatial and Temporal Alignment
Robust assessment requires metrics that quantify not only audio and video fidelity, but spatial consistency. Key metrics include:
- STFT Lâ‚‚ Distance, Envelope Distance, Magnitude/Phase Losses: Standard for waveform similarity between generated and ground-truth stereo channels (Chen et al., 6 Jan 2025, Li et al., 2023, Zhou et al., 2020, Karchkhadze et al., 22 Sep 2025).
- Signal-to-Noise Ratio (SNR): Typical in audio evaluation, with stereo/mono comparisons (Chen et al., 6 Jan 2025).
- Spatial AV-Align: Proposed by SAVGBench (Shimada et al., 18 Dec 2024), this metric quantifies the percentage of frames where detected sound event azimuth coincides with visually detected object locations.
- Spl-Perception/SPL-Distance: Measures interaural level difference (ILD) over time, thus capturing the perceptual realism of spatial separation (Li et al., 2023).
- Bin-Alignment Score (BAS): In StereoFoley, compares the binned position of object tracks in video to the audio energy centroid localization, validated by strong correlation with human judgments (Karchkhadze et al., 22 Sep 2025).
- VABench Stereophonic Metrics: Covers stereo width, imaging stability (ITD/ILD variance), envelope correlation, phase coherence (multi-band), mono compatibility, level stability, transient sync, and directional consistency (Hua et al., 10 Dec 2025).
- Distributional Audio-Visual and Semantic Metrics: Fréchet Audio Distance (FAD), Fréchet Video Distance (FVD), Fréchet AV Distance (FAVD), IB-score, KL divergence, and Inception Score monitor the alignment of synthesized audio/video distributions to real data (Li et al., 29 Dec 2024, Wang et al., 24 Jun 2025, Zhao et al., 6 Feb 2025, Marinoni et al., 7 Oct 2025).
Human listening studies complement these quantitative metrics, especially for perceptual realism, semantic source separation, and stereo plausibility (Li et al., 2023, Karchkhadze et al., 22 Sep 2025, Saito et al., 2 Oct 2025, Hua et al., 10 Dec 2025).
5. Key Advances, Limitations, and Lessons from Recent Models
Models such as CCStereo (Chen et al., 6 Jan 2025), SAGM (Li et al., 2023), and Sep-Stereo (Zhou et al., 2020) establish the necessity of fusing spatio-temporal video features at multiple scales, with conditional normalization and/or adversarial objectives to enforce spatial realism. Unified diffusion transformers (e.g. Kling-Foley (Wang et al., 24 Jun 2025), Tri-Ergon (Li et al., 29 Dec 2024)) extend to large-scale, multi-task, and multi-modal scenarios, leveraging universal stereo audio codecs, adaptive layer normalization, and explicit stereo rendering heads.
Evaluation on FAIR-Play, MUSIC, and SVGSA24 benchmarks demonstrates:
- Object-aware stereo imaging is only reliably achieved with models that either (a) inject explicit object tracking and spatialization at training (e.g. synthetic data in StereoFoley (Karchkhadze et al., 22 Sep 2025)) or (b) employ fine-grained visual-convolutional conditioning (AVAD, cross-attention with localization features) (Chen et al., 6 Jan 2025, Marinoni et al., 7 Oct 2025).
- General-purpose models (Veo3, Sora2, Wan2.5, Kling-Foley) can match and sometimes exceed human-level technical metrics (phase coherence, mono compatibility), but semantic left/right separation remains inconsistent in practice (Hua et al., 10 Dec 2025).
- Stereo audio-video generation is more sensitive to spatial failings—such as mislocalization, excessive panning symmetry, or collapsed stereo width—than mono audio-video tasks, as evidenced by VABench analyses (Hua et al., 10 Dec 2025).
Current systems generally outperform mono or non-spatial baselines in quantitative alignment and perceptual MOS, but explicit spatial object-awareness is best achieved by task-specific data augmentation or synthetic generation pipelines (Karchkhadze et al., 22 Sep 2025). A plausible implication is that future breakthroughs will require richer annotated datasets, improved spatial supervision (e.g., HRTF/binaural rendering), and task-driven architectural innovations.
6. Open Challenges and Future Directions
VABench (Hua et al., 10 Dec 2025), SAVGBench (Shimada et al., 18 Dec 2024), and recent reviews articulate several outstanding technical gaps and research priorities:
- Semantic Channel Separation: No current end-to-end system reliably generates distinct audio sources assigned to specific left/right visual positions in response to explicit conditioning (e.g., "sound A on left, sound B on right").
- Spatial Width vs. Signal Fidelity Trade-off: Models with the highest stereo width often sacrifice signal integrity and vice versa; balancing these remains an open objective (Hua et al., 10 Dec 2025).
- Explicit Geometric and Physical Modeling: Most architectures lack explicit spatial audio representations (e.g., angle-aware, ambisonic latents) or geometric constraints tying audio localization to detected object positions (Shimada et al., 18 Dec 2024).
- Human Evaluation and Perceptual Testing: Incorporation of realistic binaural rendering, HRTFs, and personalized perceptual assessment (with headphones or VR) is needed to close the evaluation loop (Hua et al., 10 Dec 2025).
- Benchmarking and QA: Dynamic QA-style evaluation—verifying that sound trajectories match video events, directions, and spatio-temporal cues—will enable more granular progress (Hua et al., 10 Dec 2025).
- Live and Causal Generation: Real-time applications require causal, efficient architectures (e.g., SoundReactor (Saito et al., 2 Oct 2025)) without degrading spatial and semantic quality.
- Long-Range Multi-Event Scenarios: Most state-of-the-art models operate on short clips (≤10 s). Scaling to long-duration, multi-object, and real-world scenes is necessary for deployment.
These directions are being actively pursued via synthetic data generation, improved synchronization modules (e.g., SynchFormer in Kling-Foley (Wang et al., 24 Jun 2025)), stereo-aware loss functions, and cross-modal geometry-aware scheduling (Marinoni et al., 7 Oct 2025, Karchkhadze et al., 22 Sep 2025).
7. Summary Table: Major Models and Their Benchmarking
| Model | Key Stereo Mechanism | Notable Benchmark Results | AV-Align/Spatial Alignment |
|---|---|---|---|
| CCStereo (Chen et al., 6 Jan 2025) | AVAD, U-Net, Cross-Attn | STFT 0.823, SNR 7.144 (FAIR-Play/10split) | SOTA SNR and error on FAIR-Play, MUSIC |
| SAGM (Li et al., 2023) | Visually guided GAN | STFT 0.851, SNR 7.044 (FAIR-Play) | SPL-Distance correlates with MOS |
| Sep-Stereo (Zhou et al., 2020) | APNet, source-separation fusion | STFT 0.879 / ENV 0.135 (FAIR-Play) | Unified multi-task with separation |
| StereoSync (Marinoni et al., 7 Oct 2025) | Depth/Box CA, LDM backbone | AV-Align 0.78, FAD 0.230, E-L1 0.047 | Robust spatial tracking w/ motion |
| Kling-Foley (Wang et al., 24 Jun 2025) | MM-DiT+Flux, universal stereo codec | FD 7.60, IB 30.75, DeSync 0.43 (VGGSound) | SOTA semantic alignment |
| Tri-Ergon (Li et al., 29 Dec 2024) | DiT, LUFS per-channel control | FD_openl3 113.21, AV-Align 0.231 (MM-V2A) | Fine-grained loudness, 44.1kHz stereo |
| StereoFoley (Karchkhadze et al., 22 Sep 2025) | Object-aware gen. + synthetic data | BAS 0.33, MOS 3.46 (VGG-obj) | SOTA object-stereo correspondence |
| SoundReactor (Saito et al., 2 Oct 2025) | Causal AR+diffusion, real-time | Stereo FAD/MMD/FSAD competitive, 26.3ms latency | Online/causal, meets real-time budget |
This table encapsulates the core structural and performance dimensions of current research in stereo audio-video generation, providing direct links to both architectural innovation and quantitative benchmarks.