An Empirical Study of Autoregressive Pre-training from Videos
This paper presents a detailed empirical analysis of autoregressive pre-training using video data, executed through the development of a series of transformative models, denoted as Toto-. These models address the burgeoning field of autoregressive models applied to video content. Primarily, the paper involves the novel process of treating video inputs as sequences of visual tokens, employing these tokens for training transformer models to predict successive tokens autoregressively. This approach aligns with the general paradigm of next-token prediction seen in LLMs but adapts it for the visual domain.
The pre-training is conducted on a colossal dataset containing over a trillion visual tokens sourced from both videos and images. This substantial dataset facilitates the training of high-capacity models, which are evaluated across a spectrum of downstream tasks such as image recognition, video classification, object tracking, and robotic applications. A notable aspect of this paper is its exploration of various design options related to architecture, training strategies, and inference methods to refine the performance of the models.
From the empirical results, the research finds that autoregressive video models achieve competitive performance on all evaluated downstream benchmarks, even in scenarios where minimal inductive biases are present. Furthermore, when examining the scaling behavior of these models, the paper observes a similar scaling curve to LLMs, though at a distinct rate, indicating nuances in the complexity and computation dynamics unique to visual data.
The paper also explores specific design considerations: utilizing different tokenizers, such as dVAE and VQGAN; adopting various architectures like LLaMA and GPT-2; and analyzing compute-efficient pre-training strategies like progressive resolution scaling. This exhaustive exploration reveals that discrete and continuous patch-normalized token models perform similarly in image classification tasks but highlight approaches for efficient resolution scaling that improve training efficiency.
The paper also addresses probing strategies for evaluating representation quality, showing that attention pooling outperforms average pooling due to its ability to account for receptive field differences across models and tasks. Notably, for decoder-only autoregressive models, mid-layer activations offer the best performance, which contrasts with encoder-based models that typically leverage final layer activations.
From a practical standpoint, the research has significant implications for improving visual recognition tasks and extending the utility of video-pretrained models to areas like robotics and video forecasting. Theoretical implications suggest a more profound understanding of tokenization's role and effective model design in visual context pre-training. Looking to the future, these insights could guide the development of more robust, scalable models that better leverage the vast, unlabeled video data available, potentially paralleling the transformative impact seen with autoregressive pre-training in NLP.
Overall, the paper enriches the discourse on cross-modal learning, demonstrating the adaptability of autoregressive approaches from text to complex visual contexts, and providing a concrete foundation for subsequent innovations in the field of visual representation learning.