Cluster and Predict Latent Patches for Improved Masked Image Modeling (2502.08769v3)

Published 12 Feb 2025 in cs.CV and cs.AI

Abstract: Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. We release all our code and models.

Summary

The paper introduces CAPI, a novel framework for Masked Image Modeling that clusters and predicts latent patches using a stable online clustering loss, diverging from traditional pixel reconstruction.
CAPI achieves state-of-the-art results among MIM methods, demonstrating 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K, approaching the performance of DINOv2.
The research suggests that latent space reconstruction via robust clustering is a powerful alternative to pixel-based methods, opening avenues for scalable and stable self-supervised vision transformers.

Cluster and Predict Latent Patches for Improved Masked Image Modeling

The paper "Cluster and Predict Latent Patches for Improved Masked Image Modeling" offers an exploration into advancing Masked Image Modeling (MIM), establishing a novel framework designated as CAPI. The paper conducts an exhaustive investigation into MIM, focusing on optimizing target representations, loss functions, and network architectures to enhance the learning capacity of self-supervised models.

The principal innovation discussed is the development of CAPI, an acronym for Cluster and Predict Latent Patches, which employs a clustering-based loss to facilitate stable and scalable training. The approach diverges from traditional pixel reconstruction methodologies and aligns more closely with latent space predictions, drawing inspiration from prior works such as DINO and iBOT. CAPI differentiates itself by implementing an explicit clustering process — distancing itself from the previously utilized implicit clustering strategies like the MLP head in DINO — resulting in increased transparency and stability during training.

Key Contributions and Methodological Insights

Clustering and Prediction Framework: The proposed method utilizes a teacher-student framework hinged upon self-distillation. Here, the teacher encodes the full image to obtain patch token representations, which are clustered online to form the basis of training targets. The student model, in contrast, receives a partially masked image and predicts the cluster assignments for the absent patches.
Stable Clustering Loss: CAPI employs an online latent clustering approach leveraging the Sinkhorn-Knopp algorithm, which prevents empty clusters and ensures near-uniform distribution of tokens across clusters. This method adapts well and stabilizes without the need for stringent hyperparameter tuning typically associated with contrastive or auxiliary losses.
Efficiency Through Cross-Attention: A significant architectural refinement is the utilization of a cross-attention predictor model. It presents enhanced computational efficiency by processing fewer tokens and allows independence in predictions, eliminating the necessity for repeated forward pass iterations.

Empirical Results and Performance Metrics

CAPI's efficacy is demonstrated through extensive experimental validation, where it successfully surpasses previous masked image modeling methods in accuracy, evidenced by an 83.8% accuracy on ImageNet and a 32.1% mIoU on ADE20K. Notably, CAPI's performance closely approaches that of DINOv2, representing significant progress in the domain. Ablation studies corroborate the improvements in predictor architecture and loss formulation integral to CAPI's performance.

Implications and Future Directions

The research posited by this work reveals promising implications for the future of self-supervised learning, especially in the field of computer vision. By focusing on latent space reconstructions and leveraging robust clustering mechanisms, CAPI highlights a powerful alternative to pixel-based reconstructions in MIM. This line of inquiry could lead to a new generation of efficient, scalable vision transformers capable of robustly handling diverse datasets beyond conventional object-centric ones.

Potential future work could focus on further scaling of the CAPI model, optimizing its computational footprint, and exploring its adaptability to other vision tasks beyond segmentation and classification. Additionally, expanding the framework to handle even larger datasets without diminishing returns could solidify CAPI's place as a cornerstone technique in self-supervised representation learning.

In conclusion, the paper reshapes masked image modeling by integrating clustering into the latent representation prediction process, setting a new benchmark for the field and opening avenues for further exploration into scalable and stable self-supervised learning paradigms.