Next Block Prediction: Video Generation via Semi-Autoregressive Modeling (2502.07737v2)

Published 11 Feb 2025 in cs.CV and cs.AI

Abstract: Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.

Summary

The paper introduces a semi-autoregressive model that predicts video blocks concurrently, significantly improving generation speed and computational efficiency.
It leverages bidirectional attention within blocks to capture complex spatial and temporal dependencies, achieving lower FVD scores on UCF101 and K600.
Scaling model parameters further enhances generation quality, demonstrating potential for real-time video processing and broader multimedia applications.

Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

This paper introduces a novel semi-autoregressive (semi-AR) framework called Next Block Prediction (NBP) for video generation. Unlike traditional autoregressive (AR) models, which rely heavily on Next-Token Prediction (NTP), this method focuses on predicting blocks of data concurrently, addressing significant inefficiencies in traditional AR approaches.

Research Context and Motivation

The field of video generation has significantly lagged behind image generation due to the complexity of temporal dependencies. Traditional AR approaches, such as Next-Token Prediction (NTP), often suffer from computational inefficiency, stemming from unidirectional dependence and an excessive number of generation steps during inference. This research seeks to address these limitations by shifting the focus from token-level to block-level predictions, thus utilizing bidirectional attention mechanisms for within-block computations.

Methodology

The proposed NBP framework divides video content into uniform blocks (e.g., rows or frames). The transformation from token to block changes how dependencies are handled, allowing for concurrent token predictions within a block. Importantly, the framework employs bidirectional attention within each block. This strategy captures complex spatial dependencies more robustly and reduces the number of required generation steps, ostensibly accelerating the inference process.

The NBP approach's efficiency was demonstrated by achieving FVD (Fréchet Video Distance) scores of 103.3 on UCF101 and 25.5 on K600, which outperforms the benchmark NTP models. Notably, NBP generated at a speed of 8.89 frames (128×128 resolution) per second, reflecting an 11× improvement.

Experimental Framework

The research conducted an extensive empirical analysis using comprehensive data sets, including UCF-101 and Kinetics-600 (K600). Model sizes ranged from 700M to 3B parameters. Notably, the experiments demonstrated that larger model scales result in improved generation quality, affirming expectations about model scalability: FVD scores improved markedly from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600 with increased model parameters.

This paper differentiates itself by leveraging the operational efficiency of semi-autoregressive methods in video generation, drawing contrasts with existing frameworks like GANs and diffusion models, often burdened by their inherent computational and training complexities. The comparison reveals that while GANs and diffusion models excel in specific contexts, the NBP's inherent design tailored for video data's temporal dimension sets it apart, particularly in scenarios demanding accelerated inferences.

Implications and Future Work

From a practical standpoint, the introduction of the NBP framework carries significant implications for multimedia processing, enhancing real-time video generation capabilities. Theoretically, it opens avenues for applying semi-autoregressive frameworks in other domains where autoregressive token dependencies limit efficiency.

Future research could explore integrating pre-trained LLM data into the NBP framework, potentially enhancing cross-modal generation tasks. Furthermore, refining the block-attention mechanisms or exploring adaptive block-sizing strategies might yield even greater gains in generation fidelity and efficiency.

In conclusion, this paper positions the NBP framework as a potent alternative within the video generation field, balancing efficiency with quality. As the domain of multi-modal AI continues to expand, the principles demonstrated here may well guide future developments in various multimedia applications.

PDF Markdown

Tweets

https://twitter.com/arXivGPT/status/1890462777405214761