Masked Feature Prediction for Self-Supervised Visual Pre-Training

Published 16 Dec 2021 in cs.CV and cs.LG | (2112.09133v2)

Abstract: We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

Abstract PDF Upgrade to Chat

Citations (603)

View on Semantic Scholar

Summary

The paper introduces MaskFeat, a self-supervised pre-training method that predicts masked HOG features to enhance visual understanding.
It achieves remarkable performance with accuracies of 86.7%, 88.3%, and 80.4% on Kinetics-400, 600, and 700 without extra model weights.
The approach simplifies learning by using minimal augmentation and a single network, generalizing effectively to still image tasks like ImageNet.

Overview of Masked Feature Prediction for Self-Supervised Visual Pre-Training

In the paper "Masked Feature Prediction for Self-Supervised Visual Pre-Training," the authors introduce an innovative approach called MaskFeat to enhance self-supervised learning in video and image processing through masked feature prediction. This method, designed for models like Transformers, encompasses randomly masking segments of an input sequence and predicting features of these masked areas, exploring the efficacy of various feature types.

Key Findings

Feature Type Evaluation: The study evaluated five feature types, concluding that Histograms of Oriented Gradients (HOG) offer superior performance and efficiency. HOG displayed effectiveness, aligning with the success seen in its historical applications for visual recognition tasks.
Performance Results: Without relying on additional model weights or supervision, MaskFeat achieved remarkable results. Specifically, MaskFeat's performance on video datasets such as Kinetics-400, Kinetics-600, and Kinetics-700 displayed top accuracies of 86.7%, 88.3%, and 80.4%, respectively, showcasing its strength in the video domain.
Generalization to Image Input: Extending MaskFeat's application to still images, interpreted as single-frame videos, it also yielded competitive results on ImageNet. This demonstrates its adaptability across various visual data types.
Efficiency: Compared to contrastive learning methods which often require complex architectures and data augmentations, MaskFeat's reliance on single networks and views, along with minimal augmentation, marks it as a straightforward and efficient approach.

Implications and Discussions

Model Simplicity and Efficacy: MaskFeat simplifies the existing methods by removing the necessity for discrete token predictions or additional tokenizers like those used in BEiT, leading to both computational efficiency and improved results.
The Role of HOG: By leveraging hand-crafted descriptors, specifically HOG, MaskFeat might be revisiting and revitalizing these traditional methods in the context of modern deep learning paradigms.
Potential in Video Understanding: The successful application in video domains hints at MaskFeat's promise in improving temporal tasks that handle more complex, continuous input sequences.

Theoretical and Practical Implications

Theoretically, MaskFeat manages to blur the lines between past and present methodologies, indicating that combining hand-crafted features with modern machine learning models may yield fruitful outcomes. Practically, the implications are vast with large-scale applicability ranging from unlabelled video content analysis to improvements in computational load management.

Speculation on Future Developments

Given the broad applicability and effectiveness demonstrated, MaskFeat could inspire further exploration into mixed-method pre-training schemes across AI disciplines. Future directions might integrate more sophisticated feature descriptors or exploit MaskFeat’s framework in multi-task learning environments.

Overall, this research highlights a critical step toward efficient and scalable self-supervised learning models, emphasizing simplicity without sacrificing performance. The MaskFeat approach, through its utilization of HOG, opens avenues for bridging traditional visual descriptors with state-of-the-art learning architectures, potentially setting a precedence for upcoming advancements in visual AI.

Markdown