LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (2410.21264v2)

Published 28 Oct 2024 in cs.CV and cs.AI

Abstract: We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal LLMs (MLLMs).

Citations (1)

View on Semantic Scholar

Summary

The paper introduces LARP, a novel video tokenization method using learned queries and an autoregressive generative prior to enhance video generation for AR models.
LARP achieves state-of-the-art video generation performance, reaching an FVD score of 57 on the UCF101 benchmark and surpassing previous methods.
The method uses learned queries for flexible token counts and shows that optimizing the discrete latent space is crucial for AR generative quality, independent of reconstruction fidelity.

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

The paper under discussion introduces LARP, a novel approach to video tokenization designed specifically for autoregressive (AR) generative models, aiming to enhance the quality of video content generation. Unlike traditional patchwise tokenizers, which directly encode visual patches into discrete tokens, LARP implements a holistic tokenization strategy focused on capturing global semantic representations from video content through a set of learned queries. By integrating a lightweight AR transformer as a generative prior, LARP aligns its latent space with downstream AR tasks, addressing existing challenges in video tokenization methods.

Central to LARP's methodology is the decoupling of discrete tokens from fixed input patches, which affords flexibility in token count, contributing to a trade-off between tokenization efficiency and representation compactness. The learned queries empower LARP to generate high-fidelity, semantic video tokens adaptable to varying lengths without imposing specific sequence orders necessary in traditional patch-to-token models.

The paper highlights the integration of a lightweight AR prior model during training, which pushes the latent space toward an AR-optimal configuration. This model is trained jointly with the primary components of LARP but is excluded during inference, ensuring negligible computational demands post-training. The model enables the automatic determination of token order optimized for AR generation tasks, thus eliminating manual order configuration—a significant challenge in classical approaches.

LARP demonstrates state-of-the-art performance on benchmarks such as UCF101 and Kinetics-600. It achieves a Frechet Video Distance (FVD) score of 57 on the UCF101 class-conditional video generation benchmark, surpassing previously established methods, including those with closed-source solutions. The results underscore the efficacy of the holistic tokenization model combined with the AR prior, achieving superior video generation quality across varied token sequence lengths.

One of the significant empirical findings is the observed independence between reconstruction FVD and generation FVD when scaling LARP tokenizer sizes. As the tokenizer size increases, reconstruction performance improves consistently; however, generation quality does not always follow the same trend, suggesting the discrete latent space's optimization plays a more critical role in generating AR tasks. Furthermore, LARP shows adaptability as it supports an arbitrary number of latent tokens, providing a balance between AR generative performance and efficiency.

Ablation studies emphasize the AR prior model's pivotal role in enhancing the discrete latent space's generative potential. The inclusion of scheduled sampling and stochastic vector quantization (SVQ) further cements the foundational advances LARP contributes to video tokenization, suggesting these enhancements are integral to optimizing AR generation.

LARP's innovation paves the way for broader implications, particularly in the development of unified high-fidelity multimodal LLMs (MLLMs) that extend beyond video generation to multimodal comprehension and synthesis within potential applications. Future research may explore integrating this approach into broader AI systems, leveraging its potential to encode and generate multimodal data.

Overall, LARP presents a compelling advancement in the field of video tokenization for autoregressive models, significantly improving upon traditional methodologies and establishing a robust framework for future research and application in video content generation and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/hywang66/status/1851397923042361784

https://twitter.com/RamanujanVivek/status/1882882561959219510

https://twitter.com/arXivGPT/status/1851737503994093980

https://twitter.com/javaeeeee1/status/1853033368562090279

YouTube

Show All Videos