Decoder capacity sufficiency in masked autoencoders

Determine whether the masked autoencoder (MAE) decoder has sufficient capacity for pixel regression and assess whether insufficient decoder capacity causes the encoder’s later blocks to prioritize low-level detail modeling at the expense of high-level semantic representation quality; empirically validate whether increasing decoder depth mitigates this issue.

Background

In the MAE Redesign section, the authors observe that the best generic features in MAE do not reside in the final encoder block, suggesting potential role misallocation between encoder and decoder. Specifically, they note that with a shallow decoder, the encoder may be forced to handle low-level pixel regression, which can degrade high-level semantic representations in later encoder blocks.

Motivated by this observation, they hypothesize that the decoder may be under-capacity for the pixel reconstruction objective. They propose increasing decoder depth as a solution and report empirical gains, but explicitly frame the underlying cause as a conjecture rather than an established fact.

References

We conjecture that the decoder lacks sufficient capacity for pixel regression.

In Pursuit of Pixel Supervision for Visual Pre-training (2512.15715 - Yang et al., 17 Dec 2025) in Section 3.2, MAE Redesign – Deeper decoder