DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks (2304.00571v3)

Published 2 Apr 2023 in cs.CV

Abstract: This paper studies masked autoencoder (MAE) video pre-training for various temporal matching-based downstream tasks, i.e., object-level tracking tasks including video object tracking (VOT) and video object segmentation (VOS), self-supervised visual correspondence learning, dense tracking tasks including optical flow estimation and long-term point tracking, and 3D point cloud tracking. Specifically, our work explores to provide a general representation to boost the temporal matching ability in various downstream tracking tasks. To achieve this, we firstly find that a simple extension of MAE, which randomly masks out frame patches in videos and reconstruct the frame pixels, heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations. To alleviate this, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We obtain several important findings with DropMAE: 1) DropMAE is a strong and efficient temporal matching learner, which achieves better fine-tuning results on matching-based tasks than the ImageNet-based MAE with 2x faster pre-training speed. 2) DropMAE is effective for different tracking tasks, i.e., object-level matching tasks including VOT and VOS, dense tracking tasks including optical flow estimation and tracking any point (TAP), and even 3D tracking in the different modality of point cloud data. Since none exists, we build ViT-based trackers for different downstream tracking tasks, and our pre-trained DropMAE model can be directly loaded in these ViT-based trackers for fine-tuning without further modifications. Experiments on 6 downstream tracking tasks demonstrate the effectiveness of DropMAE as a general pre-trained representation for diverse tracking tasks.

Citations (56)

View on Semantic Scholar

Summary

The paper introduces DropMAE, a masked autoencoder with spatial-attention dropout specifically designed to improve temporal correspondence learning for video tracking and segmentation tasks, addressing limitations of standard MAE video extensions.
DropMAE achieves significantly faster pre-training while setting new state-of-the-art performance benchmarks on various tracking and segmentation datasets, including 75.9% AO on GOT-10k and 92.1% J F on DAVIS-16.
The authors find that leveraging motion diversity in pre-training datasets is more crucial than scene diversity for performance improvement in video tracking, offering a key insight for future dataset design.

Analyzing DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

This paper introduces a novel approach to masked autoencoder (MAE) pre-training tailored specifically for video-based tasks such as visual object tracking (VOT) and video object segmentation (VOS). The authors present a variant of the traditional MAE, named DropMAE, which integrates spatial-attention dropout to enhance temporal correspondence learning across video frames. This work sheds light on the limitations of existing MAE models when applied to video, and demonstrates substantial performance improvements through their proposal.

Key Contributions

Detection of Limitations in Existing Approaches: The paper identifies a significant shortcoming in the use of a simple video extension of MAE (TwinMAE) for video tasks. TwinMAE tends to rely primarily on spatial cues within single frames, overlooking essential temporal relations between frames. This leads to sub-optimal learning when addressing dynamic video contexts requiring temporal alignment and matching.
Spatial-Attention Dropout: The proposed solution involves implementing a spatial-attention dropout strategy within the frame reconstruction process to counteract the co-adaptation issue. This approach encourages the model to seek inter-frame, temporal correspondences, thereby enriching temporal feature learning crucial for VOT and VOS tasks.
Efficient Pre-training Mechanism: The authors reveal that their DropMAE model achieves a pre-training speed twice as fast as MAE models trained on ImageNet, yet results in superior fine-tuning performance on downstream tasks. This efficiency could be pivotal for enhancing training scalability in large-scale video datasets.
Empirical Findings: DropMAE's capacity to leverage motion diversity rather than scene diversity for pre-training sets it apart from conventional strategies. Experiments show motion diversity plays a more crucial role in improving VOT and VOS performance.
State-of-the-Art Results: Utilizing the DropMAE pre-training, the authors significantly outperform existing models, achieving new benchmarks in tracking and segmentation datasets. Notably, DropMAE demonstrates its robustness across various datasets without modifying existing architectures.

Numerical Insights and Novel Claims

Noteworthy numerical advancements include achieving 75.9% AO performance in GOT-10k and a remarkable $\mathcal{J}{content}\mathcal{F}$ score of 92.1% on DAVIS-16, marking new standards in the respective fields. These results indicate that DropMAE's method of leveraging temporal cues starkly enhances the model's efficiency and effectiveness compared to models trained under static frameworks.

Practical and Theoretical Implications

Practically, DropMAE offers valuable insights into more efficient training protocols for video-related tasks, fostering advancements in real-time VOT and VOS applications. Theoretically, the paper contributes to understanding the fundamental requirements for model pre-training in video contexts, highlighting the importance of cross-frame temporal learning. Moreover, the revelation that motion diversity is more critical than scene diversity is likely to influence future developments in video dataset design and utilization.

Speculating Future Developments in AI

Looking forward, refining spatial-attention dropout mechanisms could further improve video processing models by encouraging a broader scope of temporal engagement. Additionally, exploring pre-training datasets focused on diverse motion scenarios will likely become a pivotal direction in developing robust tracking systems across varying video contexts.

Overall, the paper establishes DropMAE as a significant contribution to video pre-training and fine-tuning strategies, amplifying performance across tracking and segmentation challenges without necessitating cumbersome modeling complexity. It serves as a solid foundation for future exploration in more efficient video learning methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - jimmy-dq/DropMAE (71 stars)