Boosting Visual-Language Models by Exploiting Hard Samples (2305.05208v2)
Abstract: Contrastive Language-Image Pre-training (CLIP) has become the standard for learning cross-modal representations between images and text. Efforts to improve its capabilities typically demand the collection of additional data and retraining with new loss functions. While effective, the added requirements limit their practical use due to the increased resource and time investments needed. In this work, we present HELIP, a cost-effective strategy tailored to enhance the performance of existing CLIP models without the need for training a model from scratch or collecting additional data. Our method allows for effortless integration with existing models' training pipelines, providing an instant boost by training them with selected challenging text-image pairs from their original training datasets. HELIP treats each text-image pair as a single point in the joint vision-language space, identifying those in close proximity as hard pairs. By incorporating the challenging data, pre-trained CLIP models are refined using both the traditional contrastive loss and the newly introduced hard negative margin loss, ensuring the challenging data is fully utilized. On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance. In particular, it improves the zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M and YFCC15M datasets. The improvements are 3.05%, 4.47%, and 10.1% respectively, achieved within two epochs of training. In addition, across fine-grained classification datasets, HELIP improves the zero-shot performance of pre-trained CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%.
- Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023.
- Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19-20, 2022, pp. 4955–4964. IEEE, 2022. doi: 10.1109/CVPRW56347.2022.00543. URL https://doi.org/10.1109/CVPRW56347.2022.00543.
- Food-101–mining discriminative components with random forests. In European conference on computer vision, pp. 446–461. Springer, 2014.
- When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing, pp. 123–132, 2021.
- Are all negatives created equal in contrastive instance discrimination? ArXiv preprint, 2020.
- Emerging properties in self-supervised vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 9630–9640. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00951. URL https://doi.org/10.1109/ICCV48922.2021.00951.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 3558–3568. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00356. URL https://openaccess.thecvf.com/content/CVPR2021/html/Changpinyo_Conceptual_12M_Pushing_Web-Scale_Image-Text_Pre-Training_To_Recognize_Long-Tail_Visual_CVPR_2021_paper.html.
- A simple framework for contrastive learning of visual representations. In Proc. of ICML, volume 119 of Proceedings of Machine Learning Research, pp. 1597–1607. PMLR, 2020a. URL http://proceedings.mlr.press/v119/chen20j.html.
- Improved baselines with momentum contrastive learning. ArXiv preprint, 2020b.
- Similarity-based classification: Concepts and algorithms. Journal of Machine Learning Research, 10(3), 2009.
- Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248–255. IEEE Computer Society, 2009. doi: 10.1109/CVPR.2009.5206848. URL https://doi.org/10.1109/CVPR.2009.5206848.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04. Conference on. IEEE, 2004.
- CLOOB: modern hopfield networks with infoloob outperform CLIP. ArXiv preprint, 2021.
- Datacomp: In search of the next generation of multimodal datasets. ArXiv preprint, 2023.
- Cyclip: Cyclic contrastive language-image pretraining. ArXiv preprint, 2022.
- Towards a unified view of parameter-efficient transfer learning. In Proc. of ICLR. OpenReview.net, 2022. URL https://openreview.net/forum?id=0RDcd5Axok.
- Boosting contrastive self-supervised learning with false negative cancellation. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang (eds.), Proc. of ICML, volume 139 of Proceedings of Machine Learning Research, pp. 4904–4916. PMLR, 2021. URL http://proceedings.mlr.press/v139/jia21b.html.
- Hard negative mixing for contrastive learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/f7cade80b7cc92b991cf4d2806d6bd78-Abstract.html.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561, 2013.
- Learning multiple layers of features from tiny images. 2009.
- Align before fuse: Vision and language representation learning with momentum distillation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 9694–9705, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/505259756244493872b7709a8a01b536-Abstract.html.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 12888–12900. PMLR, 2022a. URL https://proceedings.mlr.press/v162/li22n.html.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In Proc. of ICLR. OpenReview.net, 2022b. URL https://openreview.net/forum?id=zq1iJkNk3uN.
- Microsoft COCO: common objects in context. In Proc. of ECCV, 2014.
- Fine-grained visual classification of aircraft. Technical report, 2013.
- SLIP: self-supervision meets language-image pre-training. In Proc. of ECCV, 2022.
- M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
- Asif: Coupled data turns unimodal models to multimodal without training. ArXiv preprint, 2022.
- Scikit-learn: Machine learning in python. the Journal of machine Learning research, 2011.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2641–2649. IEEE Computer Society, 2015. doi: 10.1109/ICCV.2015.303. URL https://doi.org/10.1109/ICCV.2015.303.
- Filtering, distillation, and hard negatives for vision-language pre-training. CoRR, 2023.
- Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proc. of ICML, volume 139 of Proceedings of Machine Learning Research, pp. 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proc. of EMNLP, pp. 3982–3992, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
- Contrastive learning with hard negative samples. In Proc. of ICLR. OpenReview.net, 2021. URL https://openreview.net/forum?id=CR1XOQ0UTh-.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Max-margin contrastive learning. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pp. 8220–8230. AAAI Press, 2022. URL https://ojs.aaai.org/index.php/AAAI/article/view/20796.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proc. of ACL, pp. 2556–2565, Melbourne, Australia, 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1238. URL https://aclanthology.org/P18-1238.
- Revisiting weakly supervised pre-training of visual perception models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 804–814, 2022.
- On the geometry of generalization and memorization in deep neural networks. In Proc. of ICLR. OpenReview.net, 2021. URL https://openreview.net/forum?id=V8jrrnwGbuc.
- The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology, 2011.
- Augmentation-free graph contrastive learning with performance guarantee. arXiv preprint arXiv:2204.04874, 2022a.
- Single-pass contrastive learning can work for both homophilic and heterophilic graph. arXiv preprint arXiv:2211.10890, 2022b.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proc. of ICML, volume 119 of Proceedings of Machine Learning Research, pp. 9929–9939. PMLR, 2020. URL http://proceedings.mlr.press/v119/wang20k.html.
- Data efficient language-supervised zero-shot recognition with optimal transport distillation. In Proc. of ICLR. OpenReview.net, 2022. URL https://openreview.net/forum?id=G89-1yZLFHk.
- SUN database: Large-scale scene recognition from abbey to zoo. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pp. 3485–3492. IEEE Computer Society, 2010. doi: 10.1109/CVPR.2010.5539970. URL https://doi.org/10.1109/CVPR.2010.5539970.
- FILIP: fine-grained interactive language-image pre-training. In Proc. of ICLR. OpenReview.net, 2022. URL https://openreview.net/forum?id=cpDhcsEDC2.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Haonan Wang (84 papers)
- Minbin Huang (8 papers)
- Runhui Huang (18 papers)
- Lanqing Hong (72 papers)
- Hang Xu (204 papers)
- Tianyang Hu (40 papers)
- Xiaodan Liang (318 papers)
- Zhenguo Li (195 papers)
- Hong Cheng (74 papers)
- Kenji Kawaguchi (147 papers)