Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 106 tok/s Pro

Kimi K2 156 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption (2505.12053v1)

Published 17 May 2025 in cs.CV and cs.AI

Abstract: Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers.

Collections

Summary

Overview of VFRTok: Variable Frame Rates Video Tokenizer

The paper "VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption" presents a novel approach to video tokenization that challenges prevailing assumptions in the field of video generation. The authors propose the Duration-Proportional Information Assumption, arguing that the upper bound on the information capacity of a video is proportional to its duration rather than the number of frames. This insight forms the basis for VFRTok, a Transformer-based video tokenizer designed to encode and decode videos at variable frame rates, using asymmetric frame rate training between encoder and decoder.

Key Contributions

Duration-Proportional Information Assumption: The paper challenges the Frame-Proportional Information Assumption, which has been a foundation for existing video tokenizers. The authors propose that video information capacity should be viewed as proportional to duration, allowing for more efficient compression and representation.
VFRTok Architecture: VFRTok employs a query-based Transformer design, enabling the encoder and decoder to process different frame rates. This architecture supports scalable video generation without increasing computational overhead linearly with the frame rate.
Partial Rotary Position Embeddings (RoPE): To address the limitations of position-aware modeling, the paper introduces Partial RoPE. This novel embedding strategy decouples position encoding from content modeling, applying RoPE selectively to improve content-awareness and enhance video generation quality.

Experimental Insights

The authors conducted extensive experiments comparing VFRTok to existing tokenizers across multiple datasets, including K600 and UCF101. VFRTok demonstrates competitive reconstruction quality while achieving state-of-the-art generation fidelity. Notably, it reduces the token count to $1/8$ compared to existing methods, achieving a significant computational cost reduction. VFRTok exhibits notably faster convergence in video generation tasks, with computational efficiencies up to $21.6\times$ faster than previous works like OmniTokenizer.

Additionally, VFRTok supports video frame interpolation natively, effectively increasing frame rates from 12 FPS to 120 FPS. This feature highlights VFRTok's flexibility and potential application in high-demand video generation tasks requiring variable frame rates.

Theoretical and Practical Implications

The Duration-Proportional Information Assumption and VFRTok architecture introduce a new paradigm in video generation that could influence both theoretical models and practical applications. By decoupling frame rate from information capacity, VFRTok offers a pathway to more adaptive and efficient video models, potentially leading to broader applications in areas such as real-time video editing, streaming services, and virtual reality experiences.

In a theoretical context, VFRTok's approach to variable frame rate encoding aligns with ongoing developments in AI, emphasizing the importance of continuous spatio-temporal representation in video data processing. The Partial RoPE method further enriches the Transformer framework, offering deeper insights into balancing positional information and content modeling, a significant consideration in AI-based generation tasks.

Future Directions

The paper suggests several avenues for future research. One potential direction involves enhancing VFRTok's scalability for longer video sequences by employing causal window attention mechanisms. Such advancements could address current limitations and expand VFRTok's applicability. Further investigations into optimizing the latent space capacity for tasks beyond video generation, such as video classification and action recognition, could also leverage the insights from VFRTok's architecture.

Overall, VFRTok presents a compelling case for reevaluating traditional assumptions in video generation and highlights the transformative potential of its architecture within the field of AI-driven video processing.