- The paper introduces LARP, a novel video tokenization method using learned queries and an autoregressive generative prior to enhance video generation for AR models.
- LARP achieves state-of-the-art video generation performance, reaching an FVD score of 57 on the UCF101 benchmark and surpassing previous methods.
- The method uses learned queries for flexible token counts and shows that optimizing the discrete latent space is crucial for AR generative quality, independent of reconstruction fidelity.
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
The paper under discussion introduces LARP, a novel approach to video tokenization designed specifically for autoregressive (AR) generative models, aiming to enhance the quality of video content generation. Unlike traditional patchwise tokenizers, which directly encode visual patches into discrete tokens, LARP implements a holistic tokenization strategy focused on capturing global semantic representations from video content through a set of learned queries. By integrating a lightweight AR transformer as a generative prior, LARP aligns its latent space with downstream AR tasks, addressing existing challenges in video tokenization methods.
Central to LARP's methodology is the decoupling of discrete tokens from fixed input patches, which affords flexibility in token count, contributing to a trade-off between tokenization efficiency and representation compactness. The learned queries empower LARP to generate high-fidelity, semantic video tokens adaptable to varying lengths without imposing specific sequence orders necessary in traditional patch-to-token models.
The paper highlights the integration of a lightweight AR prior model during training, which pushes the latent space toward an AR-optimal configuration. This model is trained jointly with the primary components of LARP but is excluded during inference, ensuring negligible computational demands post-training. The model enables the automatic determination of token order optimized for AR generation tasks, thus eliminating manual order configuration—a significant challenge in classical approaches.
LARP demonstrates state-of-the-art performance on benchmarks such as UCF101 and Kinetics-600. It achieves a Frechet Video Distance (FVD) score of 57 on the UCF101 class-conditional video generation benchmark, surpassing previously established methods, including those with closed-source solutions. The results underscore the efficacy of the holistic tokenization model combined with the AR prior, achieving superior video generation quality across varied token sequence lengths.
One of the significant empirical findings is the observed independence between reconstruction FVD and generation FVD when scaling LARP tokenizer sizes. As the tokenizer size increases, reconstruction performance improves consistently; however, generation quality does not always follow the same trend, suggesting the discrete latent space's optimization plays a more critical role in generating AR tasks. Furthermore, LARP shows adaptability as it supports an arbitrary number of latent tokens, providing a balance between AR generative performance and efficiency.
Ablation studies emphasize the AR prior model's pivotal role in enhancing the discrete latent space's generative potential. The inclusion of scheduled sampling and stochastic vector quantization (SVQ) further cements the foundational advances LARP contributes to video tokenization, suggesting these enhancements are integral to optimizing AR generation.
LARP's innovation paves the way for broader implications, particularly in the development of unified high-fidelity multimodal LLMs (MLLMs) that extend beyond video generation to multimodal comprehension and synthesis within potential applications. Future research may explore integrating this approach into broader AI systems, leveraging its potential to encode and generate multimodal data.
Overall, LARP presents a compelling advancement in the field of video tokenization for autoregressive models, significantly improving upon traditional methodologies and establishing a robust framework for future research and application in video content generation and beyond.