Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Inverse Scaling Law for CLIP Training (2305.07017v2)

Published 11 May 2023 in cs.CV
An Inverse Scaling Law for CLIP Training

Abstract: CLIP, one of the pioneering foundation models that connect images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even with limited computational resources. For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling up -- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot accuracy, and meanwhile accelerate the training by ~33x compared to its OpenCLIP counterpart. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

Overview of "An Inverse Scaling Law for CLIP Training"

The paper "An Inverse Scaling Law for CLIP Training" presents an intriguing finding in the domain of Contrastive Language–Image Pre-training (CLIP) by introducing an inverse scaling law. This paper is significant in the ongoing discourse about the computational demands of training large-scale models, offering potential pathways to mitigate resource constraints without significantly compromising performance.

CLIP has revolutionized the interaction between images and text, enabling advancements in zero-shot learning paradigms. However, the extensive computational requirements associated with such models have been a barrier to broader research endeavors. The investigation into the inverse scaling law provides insights into optimizing training processes to reduce these demands.

Key Findings

  1. Inverse Scaling Law: A surprising discovery is made that larger image/text encoders facilitate the use of shorter image/text token sequences during CLIP training with minimal impact on performance. This contrasts with the prevailing understanding in model scaling, where larger models typically require more extensive resources.
  2. Token Reduction Strategies: Comprehensive experiments were conducted to explore various strategies for reducing image and text tokens. Among these, semantic information-preserving strategies such as image resizing and syntax masking for text were found to yield the best scaling results.
  3. Improvements in Training Efficiency: The proposed CLIPA framework leverages the inverse scaling law, enabling efficient CLIP training even with constrained resources like an 8 A100 GPU setup. The framework achieves notable performance benchmarks in significantly reduced time frames, demonstrating a potentially transformative impact on resource management in AI research.
  4. Significant Results: The CLIPA framework achieves a zero-shot top-1 ImageNet-1k accuracy of 69.3% using eight A100 GPUs over just four days—demonstrating substantial efficiency compared to training regimes that demand hundreds of GPUs over extended periods.

Implications and Future Directions

The findings in this paper imply not only practical strategies for enhancing the accessibility and efficiency of training foundation models but also prompt a re-evaluation of resource allocation in AI research. The inverse scaling law underscores the potential for larger models to achieve competitive performance with significantly fewer computational resources, thereby democratizing research and development in this space.

The ability to reduce necessary input token lengths without performance degradation opens avenues for further exploration of adaptive training methodologies. Future research could explore the boundary conditions of this scaling law, extend these findings to other foundation models, or investigate hybrid strategies combining the benefits of various token reduction and resizing techniques.

In a rapidly evolving landscape where computational constraints often limit research, the insights and methods introduced here could catalyze broader participation and innovation. This work potentially paves the way for more sustainable and inclusive advancements in AI.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xianhang Li (20 papers)
  2. Zeyu Wang (137 papers)
  3. Cihang Xie (91 papers)
Citations (43)