Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 113 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Autoregressive Video Generation without Vector Quantization (2412.14169v2)

Published 18 Dec 2024 in cs.CV

Abstract: This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost. Additionally, NOVA generalizes well across extended video durations and enables diverse zero-shot applications in one unified model. Code and models are publicly available at https://github.com/baaivision/NOVA.

Citations (1)

Summary

  • The paper presents NOVA, an autoregressive model that bypasses vector quantization to enhance video generation fidelity and efficiency.
  • It reformulates video creation as dual temporal frame-by-frame and spatial set-by-set predictions, leveraging bidirectional transformer insights.
  • NOVA achieves impressive performance with a 0.6B parameter model, scoring 80.1 on VBench and processing 2.75 fps on high-end GPUs.

Autoregressive Video Generation without Vector Quantization: An Examination of NOVA

The paper under review introduces a novel autoregressive model for video generation termed NOVA, which stands apart from prior models by eschewing vector quantization. Instead, NOVA employs non-quantized temporal and spatial autoregressive modeling. This approach is innovative for its application of autoregressive LLMs into visual domains without relying on discrete token spaces, thus addressing the inherent limitations of vector-quantized tokenizers which struggle to balance fidelity and compression efficiency.

Methodological Insights

The authors propose reformulating video generation as a dual problem of frame prediction and spatial prediction, split into:

  1. Temporal Frame-by-Frame Prediction: This approach uses a causal relationship between frames akin to a LLM predicting words, maintaining the in-context abilities characteristic of such models. NOVA processes videos frame by frame, thus integrating these frames over time in a streamlined fashion that preserves the autoregressive properties between them.
  2. Spatial Set-by-Set Prediction: This involves a random order of set predictions within each video frame, leveraging bidirectional transformer capabilities for efficiency without sacrificing the high-fidelity generation.

Noteworthy is the use of a design inspired by MAR for image generation, which NOVA extends into the video space without transitioning to discrete token representations. This enables NOVA to maintain a continuous input representation akin to image diffusion models but with the added benefit of autoregressive context handling.

Performance and Results

NOVA is described to surpass existing autoregressive video models in data efficiency, inference speed, visual quality, and video fluency, despite its relatively smaller model size of 0.6 billion parameters. Numerically, the model yields a VBench score of 80.1 and a processing speed of 2.75 frames per second when assessed using powerful hardware configurations (an NVIDIA A100 GPU). This demonstrates its practical viability in high-performance contexts, especially for text-to-image tasks where it reportedly outperforms even diffusion-based contemporaries when evaluated by the GenEval metric.

Implications and Future Directions

The implications of this research are vast for the field of computer vision's intersection with AI. The novel autoregressive sequence construction eschewing vector quantization signifies a potential shift towards models that can offer both high fidelity and computational efficiency without the trade-off typically involved in quantization. This approach reflects a path forward for AI models dealing with generative tasks across diverse media types.

Practically, the model offers the promise of efficient video generation capabilities scaling not only in quality but in adaptability, enhancing tasks ranging from real-time video synthesis to extended duration outputs thereby affecting domains like film, virtual reality, and real-time graphics in interactive applications.

Theoretically, NOVA contributes to discussions about the foundational architectures underpinning generative AI, suggesting that advances in the autoregressive paradigm may continue to offer meaningful improvements over diffusion models, especially when broadened from text to incorporate rich, spatially and temporally dynamic datasets. Future developments will likely focus on further model optimizations and enhancements based on these new autoregressive paradigms, potentially exploring larger datasets and further scaling model size to fully harness the in-context abilities demonstrated by NOVA.

In conclusion, NOVA models an important step toward more efficient and effective autoregressive video generation, potentially setting a benchmark for future research marrying high-fidelity output with computational efficiency.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 posts and received 7 likes.