EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens (2211.10636v6)
Abstract: Masked Video Autoencoder (MVA) approaches have demonstrated their potential by significantly outperforming previous video representation learning methods. However, they waste an excessive amount of computations and memory in predicting uninformative tokens/frames due to random masking strategies. (e.g., over 16 nodes with 128 NVIDIA A100 GPUs). To resolve this issue, we exploit the unequal information density among the patches in videos and propose EVEREST, a surprisingly efficient MVA approach for video representation learning that finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning. We further present an information-intensive frame selection strategy that allows the model to focus on informative and causal frames with minimal redundancy. Our method significantly reduces the computation and memory requirements of MVA, enabling the pre-training and fine-tuning on a single machine with 8 GPUs while achieving comparable performance to computation- and memory-heavy baselines on multiple benchmarks and the uncurated Ego4D dataset. We hope that our work contributes to reducing the barrier to further research on video understanding.
- Vivit: A video vision transformer. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), 2021.
- Token merging: Your vit but faster, 2023.
- Space-time mixing attention for video transformer. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Generative pretraining from pixels. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
- Rspnet: Relative speed perception for unsupervised video representation learning. In Proceedings of the AAAI National Conference on Artificial Intelligence (AAAI), 2021.
- Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 2012.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Vi2clr: Video and image for visual contrastive learning of representation. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- Video swin transformers for egocentric video understanding@ ego4d challenges 2022. arXiv preprint arXiv:2207.11329, 2022.
- Motion-guided masking for spatiotemporal representation learning. Proceedings of the International Conference on Computer Vision (ICCV), 2023.
- Multiscale vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- Ats: Adaptive token sampling for efficient vision transformers. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Masked autoencoders as spatiotemporal learners. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Smart frame selection for action recognition. In Proceedings of the AAAI National Conference on Artificial Intelligence (AAAI), 2021.
- The" something something" video database for learning and evaluating visual common sense. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Self-supervised co-training for video representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Mgmae: Motion guided masking for video masked autoencoding. arXiv preprint arXiv:2308.10794, 2023.
- Object state change classification in egocentric videos using the divided space-time attention mechanism. arXiv preprint arXiv:2207.11814, 2022.
- What to hide from your students: Attention-guided masked image modeling. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Scsampler: Sampling salient clips from video for efficient action recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
- Hmdb: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Egocentric video-language pretraining. arXiv preprint arXiv:2206.01670, 2022.
- Video swin transformer. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Simvtp: Simple video text pre-training with masked autoencoders. arXiv preprint arXiv:2212.03490, 2022.
- Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- K-centered patch sampling for efficient video recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Keeping your eye on the ball: Trajectory attention in video transformers. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Evolving losses for unsupervised video representation learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Mar: Masked autoencoders for efficient action recognition. IEEE Transactions on Multimedia, 2023.
- Democratising ai: Multiple meanings, goals, and methods. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp. 715–722, 2023.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Masked motion encoding for self-supervised video representation learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2235–2245, 2023.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Efficient video transformers with spatial-temporal token selection. In Proceedings of the European Conference on Computer Vision (ECCV), 2022a.
- Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 2018.
- Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560, 2023a.
- Bevt: Bert pretraining of video transformers. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6312–6322, 2023c.
- Wightman, R. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Adaframe: Adaptive frame selection for fast video recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021.
- Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Online coreset selection for rehearsal-based continual learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=f9D-5WNG4Nv.
- Mgsampler: An explainable sampling strategy for video action recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.