Identify the key to genuine zero-shot generalization under large-scale training
Ascertain the mechanisms or training strategies that enable multimodal models trained at large scale (e.g., CLIP and Stable Diffusion) to achieve true zero-shot generalization without relying on exponentially increasing pretraining concept frequency, thereby overcoming the observed log-linear sample inefficiency.
References
We conclude that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
— No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
(2404.04125 - Udandarao et al., 4 Apr 2024) in Section: Conclusions and Open Problems