Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation (2506.09350v1)

Published 11 Jun 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2

Summary

The paper introduces AAPT to convert latent video diffusion models into one-step real-time generators achieving 24fps on a single H100 GPU.
It employs block causal attention and KV cache mechanisms to reduce computational load and control error accumulation in long sequences.
Experimental results on the VBench-I2V benchmark show competitive quality and low latency, making the model ideal for interactive applications.

Autoregressive Adversarial Post-Training for Interactive Video Generation

The research presented in the paper "Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation" addresses the challenge of enhancing the efficiency and responsiveness of video generation models for real-time interactive applications. Unlike traditional video generation models that rely on computationally intensive diffusion processes, the proposed methodology leverages autoregressive adversarial post-training (AAPT) to transform pre-trained latent video diffusion models into efficient, one-step video generators.

The significance of this work lies in its ability to generate video frames in real-time, achieving a throughput of 24 frames per second (fps) at a resolution of 736x416 on a single H100 GPU. This is accomplished by implementing a novel autoregressive approach wherein a single neural function evaluation (1NFE) generates each latent frame. This efficiency is critical for applications such as interactive game engines and world simulators, where low latency and synchrony with user inputs are paramount.

Core Methodology

The primary contribution of this paper is the introduction of AAPT, which drastically optimizes the generation process through autoregressive and adversarial techniques. Unlike conventional models that process video frame-by-frame using full bidirectional attention, this model employs block causal attention, significantly reducing computational overhead. The model exploits the key-value (KV) cache to store precomputed attention values, thus shortening the inference time for subsequent frames.

The paper also highlights the critical issue of error accumulation during extended sequences, a common challenge in autoregressive models. The authors propose a student-forcing training paradigm within the adversarial framework, which, unlike teacher-forcing, does not require predefined ground-truth sequences for all prediction steps during training. This decreases the discrepancies between training and inference phases, effectively managing error propagation over longer video sequences.

Experimental Evaluation

The proposed model is evaluated on the VBench-I2V benchmark, demonstrating remarkable performance both in short and extended video generation scenarios. On the standard 120-frame and extended 1440-frame tasks, the model delivers competitive quality metrics comparable to or exceeding existing state-of-the-art methods like CausVid, Wan2.1, and MAGI-1. Importantly, it achieves this with significantly reduced computational resources and latency, emphasizing its suitability for real-time applications.

The authors further showcase their model's capabilities in interactive applications such as pose-conditioned human video generation and camera-controlled world exploration. The model's adaptability to different interactive controls affirms its versatility and robustness in dynamic environments.

Implications and Future Directions

This work has profound implications in the domain of video generation, particularly for applications requiring real-time interaction such as immersive gaming, live streaming, and interactive simulations. The reduction in computational complexity without compromising video quality is a crucial advancement, potentially enabling widespread adoption in consumer-level hardware.

Future research could explore the integration of this autoregressive architecture with other modalities for multimodal content generation, as well as enhancements in model robustness against a wider range of interactive inputs. Moreover, while the paper focuses on video generation efficiency, further exploration into preserving high fidelity in generated content over even longer durations could be beneficial.

In conclusion, the introduction of AAPT offers a forward leap in video generation technology, laying a strong foundation for innovations in real-time video-based applications. The model's balance between performance, quality, and efficiency makes it a valuable tool for the expanding field of interactive digital media.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1933056087512789228

https://twitter.com/HuggingPapers/status/1933318215859646505

https://twitter.com/erichschmidt/status/1933548614577762338

YouTube

Show All Videos