Loong: Generating Minute-level Long Videos with Autoregressive Language Models (2410.02757v1)

Published 3 Oct 2024 in cs.CV

Abstract: It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive LLMs have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://epiphqny.github.io/Loong-video.

Authors (8)

Yuqing Wang (83 papers)
Tianwei Xiong (4 papers)
Daquan Zhou (47 papers)
Zhijie Lin (30 papers)
Yang Zhao (382 papers)
Bingyi Kang (39 papers)
Jiashi Feng (295 papers)
Xihui Liu (92 papers)

Citations (12)

View on Semantic Scholar

Summary

Overview of "Loong: Generating Minute-level Long Videos with Autoregressive LLMs"

The paper "Loong: Generating Minute-level Long Videos with Autoregressive LLMs" introduces an advanced framework for producing extended duration videos, utilizing autoregressive LLMs. This research progresses beyond the typical focus on generating short video clips, offering insights into overcoming challenges associated with crafting minute-level videos.

Key Contributions

The authors propose a novel video generation model, Loong, which leverages autoregressive LLMs to synthesize videos from text descriptions. A critical innovation is the unified modeling of text and video tokens, allowing for autoregressive next-token prediction using decoder-only transformers. The framework integrates a causal 3D CNN-based video tokenizer, effectively compressing video frames into discrete tokens.

Challenges Addressed

Two main issues are identified as impediments to minute-level video generation: imbalanced training loss and error accumulation during inference. To address these:

Imbalanced Loss: The paper highlights a training difficulty where early frame tokens incur a higher loss compared to later ones due to the compounded ease of prediction with more context. The proposed solution is a progressive short-to-long training methodology, paired with a loss re-weighting scheme, to emphasize early frame prediction during training.
Error Accumulation: The strong inter-frame dependencies can exacerbate prediction errors during inference. To mitigate this, the authors introduce token re-encoding and refined sampling strategies, reducing distributional shifts and error propagation.

Methodology

Progressive Training Strategy

A sophisticated training regimen is employed, starting with image-based pretraining, transitioning to short videos, and finally extending to 10-second clips. This approach allows for gradual adaptation to long-range dependencies and complex video dynamics.

Token Re-encoding and Sampling

During inference, re-encoding of video tokens is used to align distribution across video segments, enhancing coherence in extended outputs. A top- $k$ sampling strategy balances diversity and quality, preventing error accumulation without sacrificing motion variability.

Super-Resolution Post-Processing

The framework incorporates a super-resolution and refinement module to augment the spatial fidelity of low-resolution outputs, rendering high-quality video content at scale.

Results and Impact

The experimental results reveal that Loong competently generates videos with consistent visual fidelity and dynamic content, aligning closely with textual prompts. Qualitatively, the model outperforms existing methods in long-form video generation tasks, as evidenced by user studies comparing Loong's outputs to those from state-of-the-art diffusion models like StreamingT2V.

Scalability and Versatility

The scaling experiments with models up to 7B parameters confirm the enhanced capabilities afforded by larger architectures in managing complex generative tasks. Additionally, the Loong framework demonstrates versatility across various generation scenarios, including adapting to dense text inputs in a zero-shot manner.

Implications and Future Directions

The work opens up broader implications for both practical applications and the theoretical understanding of multimodal LLMs. Practically, it offers improved tools for content creation in media and entertainment. Theoretically, it provides insights into extending LLM capabilities to more complex, multi-modal tasks. Future research could build on Loong by further exploring tokenization techniques and cross-modal integration, expanding the reach of autoregressive models into more diversified domains involving extensive temporal dependencies.

In conclusion, Loong represents a significant step forward in long video generation, displaying potential for both applied and foundational advancements in AI-mediated video synthesis. The techniques introduced could inspire further exploration into the scalable application of LLMs across different modalities and temporal scopes.