Introduction
The emergence of multimodal LLMs has accelerated progress towards AI systems that can understand and generate content across text and visual domains. Existing LLMs have been highly effective in image representation; however, adapting them for video—a dynamic medium with significant temporal aspects—presents additional challenges. We propose Video-LaVIT (Language-VIsion Transformer), advancing video-language pre-training by efficiently integrating spatiotemporal dynamics into LLMs.
Architecture
Video-LaVIT introduces an efficient video representation that disentangles videos into keyframes and temporal motions, each tokenized to generate fewer tokens while preserving essential visual and motion content. This method leverages the inherent redundancy in video data, targeting the main visual semantics through keyframes and capturing the temporal motion evolution in a compact form. Specifically, keyframes correspond with a visual tokenizer, repurposing pre-trained image LLM knowledge, while the temporal motions are captured by a novel spatiotemporal encoder that quantizes motion vectors, culminating in substantial token savings (> 90% tokens saved for a 2.2s clip). Video-LaVIT comprises two main components: a tokenizer for video modality handling and a detokenizer to reconstruct the original video pixels efficiently.
Unified Generative Modeling
The framework delivers unified generative pre-training through an autoregressive model that extends beyond the image modality to videos. It is capable of ingesting synchronous keyframe-motion token sequences with images and text, followed by optimization under a single next-token prediction objective. The result is a system that internalizes the sequential relationships between video clips, thus enhancing the model's capacity to understand and generate long sequences of multimodal content.
Experimental Validation
Video-LaVIT demonstrates competitive and, in many cases, state-of-the-art performance across 13 multimodal benchmarks covering image and video understanding and generation tasks. Experiments on zero-shot video question answering show clear numerical advantages. For instance, on MSVD-QA, Video-LaVIT achieves an accuracy of 73.5%, outperforming previous methods. In the demanding task of zero-shot text-to-video generation, our model showcases impressive results, surpassing established baselines in terms of the Frechet video distance (169.51 on MSR-VTT) and other metrics reflective of video quality.
Qualitatively, comparing generated content to contemporaneous models reveals Video-LaVIT’s strengths in generating cohesive and contextually appropriate images and videos. Whether by text-to-image or image-to-video generation, the model displays an exceptional ability to grasp and visually depict intricate textual descriptions.
Conclusion
In conclusion, Video-LaVIT achieves significant advancements in video-language pre-training. By efficiently tokenizing video into keyframes and temporal motions, it enables LLMs to embrace the video field effectively. The combination of quantitative performance metrics, qualitative demonstrations, and the innovations in model design solidify Video-LaVIT as a crucial step towards holistic, multimodal AI that can seamlessly traverse the domains of text, images, and videos. Further details on the architecture, along with insights on the ablation studies and limitations, can be found in the supplementary materials accompanying the paper.