Assess the coverage of downstream evaluation concepts within multimodal pretraining datasets
Determine the extent to which large-scale multimodal pretraining datasets used for models such as CLIP and Stable Diffusion contain the downstream concepts employed during so-called "zero-shot" evaluation across classification, retrieval, and text-to-image generation tasks, in order to assess the meaningfulness of reported "zero-shot" generalization claims.
References
However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation.
— No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
(2404.04125 - Udandarao et al., 4 Apr 2024) in Abstract