Assess the coverage of downstream evaluation concepts within multimodal pretraining datasets

Determine the extent to which large-scale multimodal pretraining datasets used for models such as CLIP and Stable Diffusion contain the downstream concepts employed during so-called "zero-shot" evaluation across classification, retrieval, and text-to-image generation tasks, in order to assess the meaningfulness of reported "zero-shot" generalization claims.

Background

The paper questions the validity of "zero-shot" generalization claims for multimodal models by pointing out that it is not established how well pretraining datasets cover the concepts targeted during evaluation. This motivates their data-centric analysis of concept frequencies across several widely used pretraining corpora and many downstream tasks.

Although the authors conduct an extensive empirical paper relating concept frequency to performance, the abstract explicitly frames the unknown about coverage as a motivating unresolved issue, emphasizing the need to quantify the overlap between pretraining data and evaluation concepts.

References

However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation.

— No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance (2404.04125 - Udandarao et al., 4 Apr 2024) in Abstract

Assess the coverage of downstream evaluation concepts within multimodal pretraining datasets

Background

References

Related Problems