Extending Video Masked Autoencoders to 128 frames (2411.13683v1)

Published 20 Nov 2024 in cs.CV

Abstract: Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice. Nevertheless, the majority of prior works that leverage MAE pre-training have focused on relatively short video representations (16 / 32 frames in length) largely due to hardware memory and compute limitations that scale poorly with video length due to the dense memory-intensive self-attention decoding. One natural strategy to address these challenges is to subsample tokens to reconstruct during decoding (or decoder masking). In this work, we propose an effective strategy for prioritizing tokens which allows training on longer video sequences (128 frames) and gets better performance than, more typical, random and uniform masking strategies. The core of our approach is an adaptive decoder masking strategy that prioritizes the most important tokens and uses quantized tokens as reconstruction objectives. Our adaptive strategy leverages a powerful MAGVIT-based tokenizer that jointly learns the tokens and their priority. We validate our design choices through exhaustive ablations and observe improved performance of the resulting long-video (128 frames) encoders over short-video (32 frames) counterparts. With our long-video masked autoencoder (LVMAE) strategy, we surpass state-of-the-art on Diving48 by 3.9 points and EPIC-Kitchens-100 verb classification by 2.5 points while relying on a simple core architecture and video-only pre-training (unlike some of the prior works that require millions of labeled video-text pairs or specialized encoders).

PDF HTML Abstract

Extending Video Masked Autoencoders to 128 Frames: Enhancements in Long Video Understanding

This paper tackles the challenge of extending the Masked Autoencoders (MAE) framework, primarily used for self-supervised learning of video representations, to accommodate longer video sequences, specifically up to 128 frames. It emphasizes overcoming limitations related to hardware memory and computational constraints inherent in processing long videos with dense self-attention. The authors propose a novel adaptive decoder masking strategy that prioritizes the most crucial tokens for reconstruction during the decoding stage. By leveraging a MAGVIT-based tokenizer, they jointly learn the tokens and their significance, allowing the model to maintain efficiency while achieving improved performance.

Key Contributions and Approach

Adaptive Decoder Masking Strategy: The core innovation lies in an adaptive strategy that utilizes a token importance scheme to prioritize tokens for reconstruction. This departs from traditional random and uniform masking strategies. Tokens that are deemed most important are prioritized during decoding, thus addressing computation limitations.
Scalability to Long Videos: By implementing the adaptive strategy, the authors manage to scale the MAE framework to handle 128 frames. This is particularly challenging as self-attention's complexity is quadratic with respect to video length. The efficient token selection through adaptive masking results in significant memory savings, making long-video encoding feasible.
Enhanced Performance: Empirically, the long-video MAE outperforms its short-video counterpart. The paper reports notable performance improvements on datasets like Diving48 and EPIC-Kitchens-100 in specific tasks such as verb classification, without the use of language supervision or extensive labeled data pairs during pre-training.
Quantized Token Reconstruction: The authors employ quantized tokens as reconstruction objectives using the MAGVIT-based tokenizer. This approach, alongside the adaptive masking, contributes to improved video representation learning.

Results and Implications

Superior Performance: The proposed long-video MAE strategy demonstrated state-of-the-art results, notably surpassing competitors on Diving48 by 3.9 points and EPIC-Kitchens-100 verb classification by 2.5 points. This performance was achieved with a simpler architecture and without relying on extensive labeled pre-training datasets, highlighting the efficacy of the proposed adaptive masking.
Practical and Theoretical Implications: On the practical front, this work enables the processing of longer video sequences using the MAE framework effectively. This is crucial in domains requiring understanding of complex and prolonged actions, such as sports analytics or surveillance. Theoretically, it reinforces the importance of efficient token management strategies in handling videos with extensive contexts.
Future Developments: This paper opens avenues for further exploration into scaling video models. Future work could investigate other efficient encoding strategies, larger models, or combining long local context processing with global memory modules. There is potential to extend these ideas to multi-modal domains involving text-video interactions.

Conclusion

The work presented in this paper constitutes a significant enhancement in video understanding by scaling masked autoencoders to handle longer sequences efficiently. The introduction of adaptive decoder masking, coupled with effective token prioritization and quantization, provides a compelling approach to navigating computational constraints in video processing. As AI systems continue to progress, the ability to understand and encode longer video sequences robustly will be invaluable, and this paper charts a promising direction towards achieving that goal.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Nitesh Bharadwaj Gundavarapu (3 papers)
Luke Friedman (7 papers)
Raghav Goyal (8 papers)
Chaitra Hegde (5 papers)
Eirikur Agustsson (27 papers)
Sagar M. Waghmare (3 papers)
Mikhail Sirotenko (10 papers)
Ming-Hsuan Yang (376 papers)
Tobias Weyand (14 papers)
Boqing Gong (100 papers)
Leonid Sigal (101 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/0xtob/status/1864432413411012997