Capacity and generalization of smaller models with Scaling on Scales (S2)

Establish whether smaller vision models augmented with Scaling on Scales (S2)—a technique that runs a pre-trained and frozen backbone on multiple image scales—possess capacity at least comparable to larger models, and determine whether pre-training such smaller models with S2 enables them to achieve generalization performance comparable to or exceeding that of larger models.

Background

The paper proposes Scaling on Scales (S2), which constructs multi-scale representations by running a pre-trained and frozen vision backbone over multiple image scales and aggregating features. Across several tasks (classification, segmentation, depth estimation, MLLMs, and robotics), S2 applied to smaller backbones often matches or outperforms much larger models at similar computational budgets.

Motivated by empirical findings that features of larger models can be well approximated via a linear transform of multi-scale features from smaller models, the authors conjecture that smaller S2-augmented models may have similar capacity to larger models and may match or surpass their generalization if pre-trained with S2. This raises a formal question about capacity and generalization parity between S2-augmented smaller models and larger models.

References

Given that most of the representation larger models have learned is also learned by multi-scale smaller models, we conjecture smaller models with S$2$ scaling have at least similar capacity as larger models. Since larger capacity allows memorizing more rare and atypical instances during pre-training when given sufficient data and thus improves generalization error, we further speculate smaller models can achieve similar or even better generalizability than larger models if pre-trained with S$2$ scaling as well.

When Do We Not Need Larger Vision Models? (2403.13043 - Shi et al., 19 Mar 2024) in Section 3.3 (Pre-Training With S2 Makes Smaller Models Better)