Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction (2411.14762v2)

Published 22 Nov 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.

PDF HTML Abstract

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

The paper presents an innovative approach to video tokenization with the introduction of CoordTok, a scalable video tokenizer that enhances the efficiency of encoding long video sequences by leveraging temporal coherency through coordinate-based patch reconstruction. The method significantly reduces the computational demands typically associated with video tokenization, especially for long videos, thereby enabling more memory-efficient processing and potential downstream tasks, such as video generation, using transformer models.

CoordTok distinguishes itself from conventional video tokenization methods by incorporating novel techniques inspired by advancements in 3D generative models. It encodes video into factorized triplane representations and reconstructs discrete video patches using randomly sampled $(x, y, t)$ coordinates. This coordinate-based approach contrasts with existing methods that necessitate reconstructing the entire video frame by frame, which inevitably demands more computational resources and results in less efficient processing.

Key Contributions and Results

Encoding Efficiency: CoordTok demonstrates notable efficiency in tokenizing long video sequences. For example, it can encode a 128-frame video of $128 \times 128$ resolution into just 1280 tokens while maintaining comparable quality with baseline methods that use significantly more tokens, such as 6144 or 8192.
Reduced Computational Requirements: This approach facilitates efficient training on hardware with constrained resources, such as a single NVIDIA 4090 GPU, allowing for maximum scalability in terms of batch sizes and training on extended video sequences without encountering memory overflow issues that affect many existing tokenizers.
Integration with Diffusion Transformers: An additional advantage of CoordTok's efficient tokenization is its compatibility with diffusion transformers. It enables memory-efficient training for transformers to generate long video sequences in a single pass, which is not feasible using other tokenizer models due to resource constraints.
Robust Quantitative Outcomes: Experimental results validate the efficacy of CoordTok, with performance metrics such as PSNR and LPIPS showing significant improvements over established baselines. The temporal coherence fidelity of CoordTok in video reconstruction tasks confirms its suitability for long video tokenization tasks.

Implications and Future Directions

The implications of CoordTok span several areas within AI that require efficient video data processing and analysis. Its potential to improve the efficiency of long video generation models can drastically impact industries relying on video content, from entertainment and media to automated surveillance and remote education.

Moving forward, improving CoordTok’s handling of highly dynamic sequences could further extend its applicability. One proposed solution is the introduction of multiple content representations dynamically determined by video complexity, much like adaptive techniques used in video codecs. Additionally, scaling the methodology to handle even more extended sequences and more varied datasets would be a critical step towards establishing CoordTok as a versatile tool in the video analytics domain.

In conclusion, CoordTok provides a robust, scalable framework that addresses the primary challenges associated with video tokenization, offering both theoretical advancements and practical implications for the cloud-native future of video processing and artificial intelligence research.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Huiwon Jang (8 papers)
Sihyun Yu (16 papers)
Jinwoo Shin (196 papers)
Pieter Abbeel (372 papers)
Younggyo Seo (25 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/papers_anon/status/1860903204743193004

https://twitter.com/huiwon0516/status/1869835713954595239

https://twitter.com/ai_bites/status/1879898673041027286