Finding effective configurations for alternative masked image modeling designs

Identify optimal hyperparameter and architectural configurations for the explored alternative masked image modeling approaches that consistently improve downstream performance, including but not limited to multi-block masking, hybrid masking ratios, hybrid masking granularity, applying Koleo loss to class tokens, decoder cross-attention, reconstruction losses on visible patches, partial masked patch selection, and feeding multi-stage encoder features to the decoder.

Background

The authors detail multiple alternative design directions they attempted—such as multi-block masking and outpainting/inpainting styles, hybrid masking ratios and granularity, Koleo loss on class tokens, decoder cross-attention, predicting both masked and visible patches, reconstructing only a subset of masked patches, and fusing multi-stage encoder features into the decoder. Across extensive experiments, these alternatives did not yield clear or consistent improvements.

They explicitly state that, while these approaches may be viable, they were unable to identify configurations that consistently enhance performance, thus leaving open the task of discovering effective settings for these design families in masked image modeling.

References

While some above explored alternatives may indeed be viable, we were unable to identify optimal configurations that consistently improved performance.

In Pursuit of Pixel Supervision for Visual Pre-training (2512.15715 - Yang et al., 17 Dec 2025) in Supplementary, Section: Failure Attempts, Limitations, and Future Directions – Subsection: Failure Attempts