Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity (2403.12267v2)

Published 18 Mar 2024 in cs.CV and cs.LG

Abstract: Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication, 2023.
  2. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  3. A tight linear time (1/2)-approximation for unconstrained submodular maximization. SIAM Journal on Computing, 44(5):1384–1402, 2015.
  4. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
  5. A simple framework for contrastive learning of visual representations. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020.
  6. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations (ICLR), 2020.
  7. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  9. On robustness and transferability of convolutional neural networks, 2021.
  10. Datacomp: In search of the next generation of multimodal datasets, 2023.
  11. Cyclip: Cyclic contrastive language-image pretraining, 2022.
  12. Provable guarantees for self-supervised deep learning with spectral contrastive loss, 2022.
  13. The many faces of robustness: A critical analysis of out-of-distribution generalization, 2021.
  14. The power of contrast for feature learning: A theoretical analysis, 2021.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  16. Data-efficient contrastive self-supervised learning: Most beneficial examples for supervised learning contribute the least. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15356–15370. PMLR, 23–29 Jul 2023.
  17. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In International Conference on Learning Representations, 2021.
  18. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In International Conference on Learning Representations, 2022.
  19. T-mars: Improving visual representations by circumventing text feature learning, 2023.
  20. Prioritized training on points that are learnable, worth learning, and not yet learnt, 2022.
  21. Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques: Proceedings of the 8th IFIP Conference on Optimization Techniques Würzburg, September 5–9, 1977, pages 234–243. Springer, 2005.
  22. Fast constrained submodular maximization: Personalized data summarization. In International Conference on Machine Learning, pages 1358–1367. PMLR, 2016.
  23. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pages 529–544. Springer, 2022.
  24. Understanding multimodal contrastive learning and incorporating unpaired data. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 4348–4380. PMLR, 25–27 Apr 2023.
  25. Deep learning on a data diet: Finding important examples early in training, 2023.
  26. Adaptive second order coresets for data-efficient machine learning, 2022.
  27. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
  28. Do imagenet classifiers generalize to imagenet?, 2019.
  29. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
  30. Learning robust global representations by penalizing local predictive power, 2019.
  31. Which features are learnt by contrastive learning? On the role of simplicity bias in class collapse and feature suppression. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38938–38970. PMLR, 23–29 Jul 2023.
  32. Robust contrastive language-image pretraining against adversarial attacks, 2023.
  33. Towards sustainable learning: Coresets for data-efficient deep learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 39314–39330. PMLR, 23–29 Jul 2023.
  34. The devil is in the details: A deep dive into the rabbit hole of data filtering, 2023.
  35. On the generalization of multi-modal contrastive learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 41677–41693. PMLR, 23–29 Jul 2023.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com