A Joint Image-Video Tokenizer for Visual Generation
This paper introduces \system, a novel transformer-based tokenizer designed for joint image and video tokenization, addressing key limitations in previous separate tokenizer models for different visual modalities. The research offers a comprehensive solution that enhances visual generation models by integrating spatial and temporal learning within a unified architecture.
Methodology and Design
At the core of \system is a spatial-temporal decoupled architecture, which employs window attention in the spatial dimension to enable efficient local aggregation and causal attention in the temporal dimension for coherent motion modeling. Unlike conventional methods that rely on separate frameworks for images and videos, \system is built on a shared architecture that leverages a progressive training paradigm to harness the complementary nature of these data types.
The proposed training strategy involves two stages: an initial image-only training phase to build spatial encoding capabilities, followed by a multi-resolution joint training phase with both image and video data, which promotes temporal modeling proficiency. Through this progressive learning approach, \system unifies the tokenization process across visual modalities, facilitating more versatile and scalable generative models.
Empirical Validation
\system's efficacy is demonstrated through extensive experiments on datasets including ImageNet, CelebA-HQ, FFHQ, UCF-101, Kinetics, and Others, where it achieves superior reconstruction performance. Notably, the research reports a 1.11 reconstruction FID on ImageNet and a 42 reconstruction FVD on UCF-101, outperforming existing methods by margins of 13% and 26%, respectively. This enhanced performance is indicative of the effective integration of spatial and temporal dynamics within a single model framework.
Applicability to Generative Models
When incorporated into generative frameworks, \system exhibits marked improvements in visual synthesis. Specifically, LLMs and diffusion models utilizing \system excel in generation tasks such as class-conditional and unconditional generation, as well as frame prediction. The ability to decode both static images and dynamic video sequences using a consistent set of parameters underlines the potential of the shared framework in achieving high-quality generative outputs.
Implications and Future Directions
The implications of this research are manifold. Practically, \system paves the way for more flexible and efficient deployments of generative models across varied visual media without necessitating distinct models for each modality. Theoretically, it expands our understanding of multi-modal learning and sets the stage for further exploration into scalable and unified architectures for diverse AI applications.
Looking forward, the scalability of \system suggests potential for advancements in efficiency and effectiveness as model and dataset sizes continue to grow. Additionally, future research could explore the extension of this framework to other modalities and tasks, broadening its application scope within artificial intelligence.
In conclusion, \system represents a significant stride in the integration of visual modalities through innovative design and training strategies, offering a robust tool for future generative model developments.