Frame-Level Online V2A Generation

Updated 4 October 2025

The paper introduces SoundReactor, the first autoregressive model enforcing full causality for real-time frame-level video-to-audio synthesis.
It employs a causal decoder-only transformer with efficient visual and audio tokenization, achieving low per-frame latency (~26–31ms) and high-fidelity audio output.
Evaluation shows that the method matches offline sequence-based approaches on metrics like FAD and MMD while enabling interactive applications such as live content creation and generative world modeling.

Frame-level online video-to-audio (V2A) generation is a paradigm focused on synchronously synthesizing audio from video, where audio is generated autoregressively without access to future visual frames. This formulation enforces strict causality and enables applications requiring low-latency real-time content generation, such as live video post-production, interactive gaming, and generative world modeling. The defining characteristics of this field are causal modeling, efficient visual-to-audio tokenization, and architectural choices optimized for synchronization, semantic fidelity, and computational efficiency.

1. Defining Frame-Level Online V2A Generation

Frame-level online V2A generation refers to the causal generation of audio aligned to video such that each audio segment is produced with only the current and previous video frames available. This contrasts with traditional offline chunked or sequence-based V2A approaches, which assume future frames or entire video sequences as input. The main requirements include:

End-to-end causality: Future frames must not be used in generating audio for the current time step.
Low per-frame latency: Generation time per frame must be suitable for real-time or near-real-time applications (e.g., tens of milliseconds per frame).
Audio-visual synchronization: Temporal alignment between generated audio segments and corresponding video frames must be preserved.
Semantic-consistent stereo audio: Full-band stereo production is often required, maintaining high-fidelity semantics and spatialization.

The SoundReactor framework explicitly formalizes this task and introduces the first autoregressive model designed explicitly for causal, frame-level online V2A (Saito et al., 2 Oct 2025).

2. Model Architecture and Tokenization Strategies

SoundReactor employs a causal decoder-only transformer backbone that operates over continuous audio latents and vision tokens. The architecture comprises three major modules:

Vision Token Modeling: Each video frame is processed through a lightweight DINOv2 vision encoder to extract grid (patch) features. Temporal difference features (frame-wise grid subtraction) are concatenated with the current frame’s grid, and a shallow transformer aggregator reduces the spatial grid to a single token per frame. This ensures each vision token summarizes all patch-level semantics and short-term dynamics, maintaining causality and efficient computation.
Audio Token Modeling: Full-band stereo waveforms are encoded via a custom VAE into continuous latent representations (one token per frame), which is superior to discrete RVQ codes for this task due to better reconstruction quality and streamlined autoregressive prediction.
Multimodal Autoregressive Transformer with Diffusion Head: The transformer takes interleaved sequences of previous audio latents and current vision tokens, predicting the next audio latent in a strictly left-to-right manner. For each frame $i$ , the transformer’s output $z_i$ forms a conditioning vector for a diffusion head, which iteratively denoises the audio latent via a deterministic reverse process:

$p(x_{1:n} | v_{1:n}) = \prod_{i=1}^{n} p(x_i | x_{<i}, v_{\leq i})$

enforcing causal dependence on only past inputs.

3. Vision Conditioning and Causal Design Principles

Vision conditioning in SoundReactor diverges from conventional T2A or V2A paradigms relying on global scene summaries. Instead, grid (patch) features from DINOv2 (21M parameters) are processed to retain localized spatial information, while concatenated temporal differences inject explicit motion cues vital for aligning audio events (e.g., footsteps, impacts).

A transformer aggregator compresses these features into a causal vision token, minimizing latency. The conditioning preserves causality: only current and past frame tokens are available for prediction, and aggregation is limited to shallow layers to prevent future frame leakage. This design is crucial to meet online requirements, as non-causal attention or conditioning would impose unacceptable latency and semantic drift.

4. Training and Optimization Strategies

The training regime consists of two stages:

Diffusion Pretraining: The model is first trained using denoising score matching (DSM) loss in an autoregressive framework. Each step involves predicting clean audio latents from their noisy counterparts conditioned on previous audio and corresponding vision tokens. The objective is:

$L = \mathbb{E}_{x^0, t, \epsilon} \left[ \lambda(t) \exp(-u_\theta(t)) \sum_i \| x_i^0 - D_\theta(x_i^t, t, z_i) \|^2 + u_\theta(t) \right]$

where $D_\theta$ is the diffusion head and $z_i$ is the transformer’s output.

Consistency Fine-Tuning (Easy Consistency Tuning, ECT): The diffusion head is further fine-tuned as a consistency model. ECT trains the head to map noisy latents at different timesteps to the clean latent efficiently—achieving single-step or few-step denoising. The annealing scheduling method ensures that models converge toward fast inference without statistical collapse.

These stages allow SoundReactor to maintain high audio quality with extremely low computational latency per frame ( $\approx$ 26–31ms on a single H100 with NFE=1–4).

5. Evaluation Metrics and Benchmarking

Performance is measured using both objective and subjective metrics:

Objective Metrics: Frechet Audio Distance (FAD), Maximum Mean Discrepancy (MMD) using OpenL3 and LAION-CLAP, FSAD for stereo fidelity, KL-divergence (PaSST-based), alignment metrics (ImageBind, DeSync), and panning accuracy.
Subjective Metrics: Human listening tests (MUSHRA-style) for audio quality, spatial accuracy, and synchronization.
Latency Benchmarks: Head-to-tail per-frame latency at 30FPS, 480p videos, with practical benchmarks (e.g., AAA gameplay video dataset OGameData250K and VGGSound).

SoundReactor empirically demonstrates low FAD, MMD, and precise alignment scores across diverse gaming and general V2A benchmarks, matching or exceeding offline sequence-based methods while retaining real-time causal operation.

6. Applications and Implications

Frame-level online V2A generation enables interactive multimedia workflows, such as:

Live content creation: Audio synthesized in lockstep with video for streaming, live performance, and virtual staging.
Generative world modeling: Autonomous generative agents that produce synchronized audiovisual simulations, critical for robotics, game AI, and immersive environments.
Interactive gaming and VR: Synchronous sound effects responding to video without pre-buffered audio, enhancing realism and presence.

SoundReactor’s design directly supports these use cases by prioritizing low-latency, high-fidelity, causally aligned audio generation (Saito et al., 2 Oct 2025).

7. Future Research Directions

As the paradigm shifts toward fully online, interactive multimodal generation, future work may address:

Memory and compute optimizations: Strategies for further reducing inference latency (e.g., quantization, pruning).
Cross-modal semantic emphasis: Enhanced conditioning from multimodal signals (text, gestures) to control timbral or spatial cues.
Generalization to other modalities: Extension of frame-level online causality to text, motion, and sensor fusion domains, leveraging autoregressive transformers and efficient latent diffusion heads.

Continued benchmarking and the development of causal multimodal datasets (including synchronized full-band stereo test suites) will further refine evaluation standards. As interactive applications proliferate, frameworks conforming to strict causality and low latency similar to SoundReactor are likely to become widespread.

Frame-level online V2A generation, exemplified by causal architectures such as SoundReactor, defines a new standard for synchronous, efficient, and semantically aligned audio synthesis from video, supporting the demands of real-time, interactive, and generative content creation.

PDF Markdown Chat (Pro)

References (1)

SoundReactor: Frame-level Online Video-to-Audio Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Frame-Level Online V2A Generation.