Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 113 tok/s Pro

Kimi K2 200 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Autoregressive Video Generation without Vector Quantization (2412.14169v2)

Published 18 Dec 2024 in cs.CV

Abstract: This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost. Additionally, NOVA generalizes well across extended video durations and enables diverse zero-shot applications in one unified model. Code and models are publicly available at https://github.com/baaivision/NOVA.

Citations (1)

View on Semantic Scholar

Summary

The paper presents NOVA, an autoregressive model that bypasses vector quantization to enhance video generation fidelity and efficiency.
It reformulates video creation as dual temporal frame-by-frame and spatial set-by-set predictions, leveraging bidirectional transformer insights.
NOVA achieves impressive performance with a 0.6B parameter model, scoring 80.1 on VBench and processing 2.75 fps on high-end GPUs.

Autoregressive Video Generation without Vector Quantization: An Examination of NOVA

The paper under review introduces a novel autoregressive model for video generation termed NOVA, which stands apart from prior models by eschewing vector quantization. Instead, NOVA employs non-quantized temporal and spatial autoregressive modeling. This approach is innovative for its application of autoregressive LLMs into visual domains without relying on discrete token spaces, thus addressing the inherent limitations of vector-quantized tokenizers which struggle to balance fidelity and compression efficiency.

Methodological Insights

The authors propose reformulating video generation as a dual problem of frame prediction and spatial prediction, split into:

Temporal Frame-by-Frame Prediction: This approach uses a causal relationship between frames akin to a LLM predicting words, maintaining the in-context abilities characteristic of such models. NOVA processes videos frame by frame, thus integrating these frames over time in a streamlined fashion that preserves the autoregressive properties between them.
Spatial Set-by-Set Prediction: This involves a random order of set predictions within each video frame, leveraging bidirectional transformer capabilities for efficiency without sacrificing the high-fidelity generation.

Noteworthy is the use of a design inspired by MAR for image generation, which NOVA extends into the video space without transitioning to discrete token representations. This enables NOVA to maintain a continuous input representation akin to image diffusion models but with the added benefit of autoregressive context handling.

Performance and Results

NOVA is described to surpass existing autoregressive video models in data efficiency, inference speed, visual quality, and video fluency, despite its relatively smaller model size of 0.6 billion parameters. Numerically, the model yields a VBench score of 80.1 and a processing speed of 2.75 frames per second when assessed using powerful hardware configurations (an NVIDIA A100 GPU). This demonstrates its practical viability in high-performance contexts, especially for text-to-image tasks where it reportedly outperforms even diffusion-based contemporaries when evaluated by the GenEval metric.

Implications and Future Directions

The implications of this research are vast for the field of computer vision's intersection with AI. The novel autoregressive sequence construction eschewing vector quantization signifies a potential shift towards models that can offer both high fidelity and computational efficiency without the trade-off typically involved in quantization. This approach reflects a path forward for AI models dealing with generative tasks across diverse media types.

Practically, the model offers the promise of efficient video generation capabilities scaling not only in quality but in adaptability, enhancing tasks ranging from real-time video synthesis to extended duration outputs thereby affecting domains like film, virtual reality, and real-time graphics in interactive applications.

Theoretically, NOVA contributes to discussions about the foundational architectures underpinning generative AI, suggesting that advances in the autoregressive paradigm may continue to offer meaningful improvements over diffusion models, especially when broadened from text to incorporate rich, spatially and temporally dynamic datasets. Future developments will likely focus on further model optimizations and enhancements based on these new autoregressive paradigms, potentially exploring larger datasets and further scaling model size to fully harness the in-context abilities demonstrated by NOVA.

In conclusion, NOVA models an important step toward more efficient and effective autoregressive video generation, potentially setting a benchmark for future research marrying high-fidelity output with computational efficiency.