Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy (2306.15658v1)

Published 27 Jun 2023 in cs.CV
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy

Abstract: The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \$10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of $4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.

A Technical Analysis of CLIPA-v2: Cost-Effective Scaling in CLIP Training

In the field of vision-LLMs, the CLIP (Contrastive Language–Image Pre-training) model has served as a transformative approach, bridging textual and visual data. The paper "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra$4,000 Unlocks 81.8% Accuracy" introduces CLIPA-v2, a method aimed at efficiently scaling CLIP training to achieve superior performance while minimizing computational costs.

Main Contributions

The paper presents two central contributions:

  1. Inverse Scaling Law in Finetuning: The authors confirm that the inverse scaling law discovered in previous work on CLIPA extends to the finetuning stage. This revelation permits the utilization of reduced input tokens even during model finetuning, consequently decreasing computational expenses without compromising model performance.
  2. Scaling in Model, Data, and Training Schedule: CLIPA-v2 scales CLIP training across different model sizes, datasets, and training schedules. They showcase an implementation reaching up to the H/14 model scale using 13B image-text pairs in training. This approach substantiates that even large-scale models can be effectively trained within limited budgets by leveraging the inverse scaling principle.

Empirical Results

The prominent empirical results in the paper include:

  • Achieving a zero-shot ImageNet accuracy of 81.1% within a $10,000 budget, showing a 1.0% improvement over the previous best, with significant computational efficiency—39 times reduction in cost compared to previous CLIP models.
  • With an additional $4,000 investment, the accuracy is elevated to 81.8%, establishing a new benchmark in the zero-shot ImageNet performance within a constrained budget.

Implications and Future Directions

Practically, the findings offer a clear pathway for researchers and institutions with limited resources to engage in large-scale pretraining of vision-LLMs. Theoretically, the paper advances the understanding of model scaling dynamics, emphasizing that larger models can be optimally trained with fewer input tokens due to the inverse scaling law.

The work invites several avenues for future research. Investigating the full potential of inverse scaling laws across various model architectures could provide universally applicable guidelines for efficient model training. Moreover, fine-grained analysis of the trade-off between model size, data diversity, and token reduction strategies might yield more insights into optimal configurations for specific applications or tasks.

Conclusion

The paper on CLIPA-v2 substantially contributes to the ongoing discourse on efficient AI model training by demonstrating that high-performing vision-LLMs can be trained within budgetary constraints through innovative scaling strategies. As AI research continues to emphasize efficiency and accessibility, such methodologies will likely play an increasingly crucial role in broadening participation in model training activities across the globe.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 2019.
  3. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  4. Can foundation models perform zero-shot task specification for robot manipulation? arXiv preprint arXiv:2204.11134, 2022.
  5. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  6. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  7. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  8. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
  9. Natural adversarial examples. In CVPR, 2021.
  10. Openclip, July 2021.
  11. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  12. An inverse scaling law for clip training. arXiv preprint arXiv:2305.07017, 2023.
  13. Scaling language-image pre-training via masking. In CVPR, 2023.
  14. Microsoft coco: Common objects in context. In ECCV, 2014.
  15. OpenAI. Gpt-4 technical report. 2023.
  16. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
  17. Learning transferable visual models from natural language supervision. In ICML, 2021.
  18. Zero-shot text-to-image generation. In ICML, 2021.
  19. Do imagenet classifiers generalize to imagenet? In ICML, 2019.
  20. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  21. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  22. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  23. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  24. Learning robust global representations by penalizing local predictive power. In NeurIPS, 2019.
  25. Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023.
  26. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  27. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  28. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xianhang Li (20 papers)
  2. Zeyu Wang (137 papers)
  3. Cihang Xie (91 papers)
Citations (12)