Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning (2402.02055v1)
Abstract: In recent years, data selection has emerged as a core issue for large-scale visual-LLM pretraining, especially on noisy web-curated datasets. One widely adopted strategy assigns quality scores such as CLIP similarity for each sample and retains the data pairs with the highest scores. However, these approaches are agnostic of data distribution and always fail to select the most informative samples. To solve this problem, we propose a simple yet theoretically principled metric named Variance Alignment Score (VAS), which has the form $\langle \Sigma_{\text{test}}, \Sigma_i\rangle$. Here, $\Sigma_{\text{test}}$ represents the target (cross-)covariance matrix we aim to align, potentially based on prior knowledge, while $\Sigma_i$ denotes the tensor product of single or multi-modal representations for the $i$-th sample. We further design a new data selection method that maximizes the total VAS. We provide theoretical analysis in a simplified setting to demonstrate the theoretical advantage of VAS over random or other existing data selection. Experimentally, applying VAS and CLIP scores together can outperform baselines by a margin of $1.3\%$ average on 38 evaluation sets for noisy dataset DataComp and $2.5\%$ on VTAB for high-quality dataset CC12M. Additionally, our ablation study also shows visual features are better than text for calculating VAS, and the related classical experimental design methods may fail under this context.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Winogavil: Gamified association benchmark to challenge vision-and-language models. Advances in Neural Information Processing Systems, 35:26549–26564, 2022.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568, 2021.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023.
- Selection via proxy: Efficient data selection for deep learning. arXiv preprint arXiv:1906.11829, 2019.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Matrix computations. JHU press, 2013.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349, 2021a.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271, 2021b.
- The power of contrast for feature learning: A theoretical analysis. Journal of Machine Learning Research, 24(330):1–78, 2023.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
- Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp. 5637–5664. PMLR, 2021.
- D2 pruning: Message passing for balancing diversity and difficulty in data pruning. arXiv preprint arXiv:2310.07931, 2023.
- Sieve: Multimodal dataset pruning using image captioning models. arXiv preprint arXiv:2310.02110, 2023.
- T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023.
- Understanding multimodal contrastive learning and incorporating unpaired data. In International Conference on Artificial Intelligence and Statistics, pp. 4348–4380. PMLR, 2023.
- Improving multimodal datasets with image captioning. arXiv preprint arXiv:2307.10350, 2023.
- Pukelsheim, F. Optimal design of experiments. SIAM, 2006.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538, 2016.
- Do ImageNet classifiers generalize to ImageNet? In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5389–5400. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/recht19a.html.
- Extending the wilds benchmark for unsupervised adaptation. arXiv preprint arXiv:2112.05090, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- Too large; data reduction for vision-language pre-training. arXiv preprint arXiv:2305.20087, 2023.
- Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.
- Submodularity in data subset selection and active learning. In International conference on machine learning, pp. 1954–1963. PMLR, 2015.
- Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. doi: 10.1162/tacl˙a˙00166. URL https://aclanthology.org/Q14-1006.
- A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.