Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data (2404.15653v1)

Published 24 Apr 2024 in cs.CV, cs.AI, cs.CL, and cs.LG
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Abstract: Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{https://github.com/apple/corenet}.

CatLIP: Achieving CLIP-Level Accuracy with Significant Training Speed Improvements

Introduction

The paper introduces CatLIP, a novel approach designed to enhance the pre-training speed of vision models on web-scale image-text data while maintaining competitive accuracy levels compared to the CLIP model. This method conceptualizes pre-training as a classification task rather than the traditional contrastive learning approach, effectively sidestepping the computational demands typical of the latter. This shift not only retains high representation quality but also achieves a remarkable reduction in training time.

Methodology

Contrastive versus Categorical Learning

CatLIP modifies the typical image-text pair alignment of CLIP, which uses contrastive loss that necessitates extensive pairwise similarity computations, a notable computational burden. Instead, CatLIP approaches this by extracting labels from text captions (specifically, nouns mapped to WordNet synsets) and treating the problem as a multi-label classification task with binary cross-entropy loss. Such reframing significantly alleviates the computational overhead by bypassing the need for pairwise comparisons.

Efficiency Demonstrated: The experiments reveal that CatLIP reduces the pre-training time by a factor of 2.7x compared to CLIP while preserving downstream task accuracies.

Data and Model Scaling

The approach was evaluated across various scales of data and model complexities:

  • Data Scaling: Increasing the dataset size improved transfer learning accuracy, highlighting CatLIP's effectiveness across different dataset magnitudes.
  • Model Scaling: Larger models under CatLIP training showed enhanced representation quality, affirming the scalability of this approach.

Transfer Initiatives: CatLIP benefits transfer learning by leveraging pre-trained embeddings for initializing classification layers in target tasks, which is particularly effective when target labels closely relate to the pre-training labels.

Comparative Analysis

CatLIP exhibits competitive performance against not only CLIP but also other state-of-the-art models. This includes achieving comparable or better results in standard benchmarks like ImageNet-1k and Places365 without the computational expense and time required by traditional contrastive learning methods.

Task Generalization

Generalization to Complex Tasks

CatLIP's robustness was further tested across more complex visual tasks:

  • Multi-Label Classification, Semantic Segmentation, and Object Detection: CatLIP maintained competitive performance across these tasks, demonstrating its capability to generalize well beyond simple image classification tasks.

Conclusion

The introduction of CatLIP presents a significant advancement in the pre-training of vision models using image-text data. By transforming the training approach to leverage categorical loss, this method not only accelerates training but also ensures the preservation of high-quality representation capabilities. This work opens new avenues for research into efficient training methods on large-scale datasets and holds promising implications for both theoretical and practical advancements in machine learning and computer vision domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  2. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  3. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020.
  4. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
  5. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  6. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  7. Contributors, M. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  8. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  9. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.  7480–7512. PMLR, 2023.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  13. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  14. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.  4904–4916. PMLR, 2021.
  16. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6399–6408, 2019.
  17. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  18. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2661–2671, 2019.
  19. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  20. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pp.  280–296. Springer, 2022.
  21. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23390–23400, 2023.
  22. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  23. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  24. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  25. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
  26. Rangeaugment: Efficient online augmentation with range learning. arXiv preprint arXiv:2212.10553, 2022.
  27. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  28. Miller, G. A. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  29. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pp.  529–544. Springer, 2022.
  30. Self-attention Does Not Need O⁢(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) Memory. arXiv preprint arXiv:2112.05682, 2021.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  32. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  33. Ml-decoder: Scalable and versatile classification head. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  32–41, 2023.
  34. Ape: Aligning pretrained encoders to quickly learn aligned multimodal representations. arXiv preprint arXiv:2210.03927, 2022.
  35. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  36. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2556–2565, 2018.
  37. Revisiting weakly supervised pre-training of visual perception models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  804–814, 2022.
  38. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  39. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pp.  418–434, 2018.
  40. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
  41. Open vocabulary multi-label classification with dual-modal decoder on aligned visual-textual features, 2023.
  42. Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19163–19173, 2022.
  43. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  44. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12104–12113, 2022a.
  45. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18123–18133, 2022b.
  46. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
  47. Segvit: Semantic segmentation with plain vision transformers. Advances in Neural Information Processing Systems, 35:4971–4982, 2022.
  48. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  49. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017a.
  50. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  633–641, 2017b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sachin Mehta (48 papers)
  2. Maxwell Horton (18 papers)
  3. Fartash Faghri (32 papers)
  4. Mohammad Hossein Sekhavat (4 papers)
  5. Mahyar Najibi (38 papers)
  6. Mehrdad Farajtabar (56 papers)
  7. Oncel Tuzel (62 papers)
  8. Mohammad Rastegari (57 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com