Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 158 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 177 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

SoundReactor: Frame-level Online Video-to-Audio Generation (2510.02110v1)

Published 2 Oct 2025 in cs.SD, cs.LG, and eess.AS

Abstract: Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.

Summary

The paper introduces an autoregressive, diffusion-based framework that generates high-quality, full-band stereo audio with low latency in an online setting.
It employs a causal vision encoder with temporal differencing and a continuous audio VAE to ensure precise audio-visual synchronization and efficient token modeling.
Experimental results show superior performance over offline baselines on AAA gameplay data, achieving real-time operation with high semantic and temporal alignment.

SoundReactor: Frame-level Online Video-to-Audio Generation

Introduction and Motivation

The paper introduces the frame-level online video-to-audio (V2A) generation task, a setting where audio must be generated autoregressively from video frames as they arrive, without access to future frames. This is in contrast to the conventional offline V2A paradigm, which assumes the entire video sequence or large chunks are available in advance. The online constraint is critical for interactive applications such as live content creation, real-time generative world models, and robotics, where low-latency, causal, and temporally aligned audio generation is required.

Figure 1: The frame-level online V2A task restricts the model to only past and current frames, unlike the offline setting where the entire video is available.

The authors propose SoundReactor, a framework explicitly designed for this online V2A setting. The design enforces end-to-end causality, targets low per-frame latency, and aims for high-quality, semantically and temporally aligned full-band stereo audio generation.

Model Architecture

SoundReactor consists of three main components: (a) video token modeling, (b) audio token modeling, and (c) a multimodal autoregressive (AR) transformer with a diffusion head.

Figure 2: Overview of SoundReactor, showing the video token modeling, audio token modeling, and the multimodal AR transformer with diffusion head.

Video Token Modeling

Utilizes a lightweight DINOv2 vision encoder to extract grid (patch) features from each frame.
Temporal cues are injected by concatenating the difference between adjacent frame features.
Features are projected, flattened, and aggregated via a shallow transformer to yield a single token per frame.
This approach is fully causal and efficient, as it does not require future frames or non-causal attention.

Audio Token Modeling

Employs a VAE trained from scratch to encode 48kHz stereo waveforms into continuous-valued audio latents at 30Hz.
Continuous latents are preferred over discrete tokenization (e.g., RVQ) for higher reconstruction quality and simplified AR modeling, as only one latent per frame is predicted.

Multimodal AR Transformer with Diffusion Head

A decoder-only, LLaMA-style transformer receives interleaved, frame-aligned audio and video tokens.
The diffusion head, following the MAR paradigm, models the conditional distribution of the next audio latent given past audio and current/past video tokens.
The transformer backbone uses RMSNorm, SwiGLU, and RoPE for positional encoding.
The diffusion head is accelerated via Easy Consistency Tuning (ECT), enabling few-step or even single-step sampling at inference.

Training and Inference

Training proceeds in two stages:

Diffusion Pretraining: The model is trained with a next-token prediction objective using a denoising score matching (DSM) loss under the EDM2 framework. The AR transformer and diffusion head are jointly optimized.
Consistency Fine-tuning (ECT): The model is further fine-tuned to accelerate the diffusion head, reducing the number of function evaluations (NFEs) required for sampling while maintaining sample quality.

At inference, the model autoregressively generates audio latents one frame at a time, conditioned only on past and current video frames. Classifier-Free Guidance (CFG) is applied at the transformer level for controllable conditioning strength.

Experimental Results

Dataset and Evaluation

Experiments are conducted on OGameData250K, a large-scale dataset of AAA gameplay videos with 48kHz stereo audio.
The primary evaluation is on 8-second clips, with additional tests on 16-second sequences for context window extension.
Metrics include FAD, MMD, KL divergence (PaSST), FSAD, IB-Score, DeSync, and subjective listening tests.

Quantitative and Qualitative Performance

SoundReactor outperforms the offline AR baseline V-AURA on all objective metrics except DeSync, and achieves strong subjective ratings in human evaluations for audio quality, semantic alignment, temporal alignment, and stereo panning.

Latency: Achieves per-frame waveform-level latency of 26.3ms (NFE=1) and 31.5ms (NFE=4) on 30FPS, 480p videos using a single H100 GPU, enabling real-time operation.
Stereo Quality: Generates high-fidelity stereo audio with accurate panning, as evidenced by FSAD and subjective stereo scores.
Context Window Extension: By leveraging NTK-aware RoPE scaling, the model can generate sequences twice as long as the training window without degradation in temporal alignment or audio quality.
Figure 3: Spectrograms of long-sequence generation with different context window extension strategies. NTK-aware interpolation preserves periodicity and timing, while position interpolation degrades cadence.

Ablation Studies

Diffusion Head Size: Larger diffusion heads are necessary for high-dimensional audio latents; small heads fail to produce valid audio.
ECT Fine-tuning: Fine-tuning the entire network during ECT yields the best results, but fine-tuning only the diffusion head is still effective.
CFG Scale: Optimal performance is achieved with CFG scales between 2.0 and 3.0.
Vision Conditioning: Temporal differencing of grid features is critical for audio-visual synchronization; PCA compression of grid features can be used without loss of performance.

Implementation Considerations

Computational Requirements: Training requires 8×H100 GPUs, with ~36 hours for diffusion pretraining and ~20 hours for ECT fine-tuning.
Model Size: The full model is 320M parameters (250M transformer, 70M diffusion head), with a 157M parameter VAE.
Inference: Efficient due to lightweight vision encoder, AR transformer with KV-cache, and ECT-accelerated diffusion head.
Deployment: Real-time operation is feasible on a single H100 GPU; causal VAE decoding is required for streaming applications.

Implications and Future Directions

SoundReactor establishes a new paradigm for online, frame-level V2A generation, enabling interactive and real-time multimodal applications. The framework demonstrates that high-quality, temporally aligned, full-band stereo audio can be generated causally with low latency, making it suitable for live content creation, generative world models, and robotics.

The use of continuous audio latents and causal vision conditioning represents a significant shift from prior offline, chunk-based, or non-causal approaches. The successful application of ECT for diffusion head acceleration in the AR setting is notable, as it enables practical deployment without sacrificing quality.

Future work should address:

Scaling to longer context windows and minute- to hour-scale generation.
Incorporating larger or more semantically rich vision encoders while maintaining causality and efficiency.
Improving the fidelity of causal VAE decoders for streaming.
Extending to more diverse real-world datasets and tasks beyond gaming.

Conclusion

SoundReactor provides a principled and effective solution to the frame-level online V2A generation problem, achieving high-quality, low-latency, and causally aligned audio generation. The framework's architectural choices, training strategies, and empirical results set a new standard for interactive multimodal generative models and open new avenues for research in real-time audio-visual synthesis.