RankCLIP: Ranking-Consistent Language-Image Pretraining (2404.09387v2)
Abstract: Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-LLMs in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
- Food-101 – Mining Discriminative Components with Random Forests. In European Conference on Computer Vision.
- Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning. 89–96.
- Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23-581 (2010), 81.
- Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. 129–136.
- Nicholas Carlini and Andreas Terzis. 2021. Poisoning and backdooring contrastive learning. arXiv preprint arXiv:2106.09667 (2021).
- Vlp: A survey on vision-language pre-training. Machine Intelligence Research 20, 1 (2023), 38–56.
- HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding. arXiv preprint arXiv:2403.00425 (2024).
- AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition. arXiv preprint arXiv:2402.11452 (2024).
- KR1442 Chowdhary and KR Chowdhary. 2020. Natural language processing. Fundamentals of artificial intelligence (2020), 603–649.
- Sanghyuk Chun. 2023. Improved probabilistic image-text representations. arXiv preprint arXiv:2305.18171 (2023).
- Describing Textures in the Wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 215–223.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision. Springer, 88–105.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936 (2022).
- Softclip: Softer cross-modal alignment makes clip stronger. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1860–1868.
- Pyramidclip: Hierarchical feature alignment for vision-language model pretraining. Advances in neural information processing systems 35 (2022), 35959–35970.
- HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention. arXiv preprint arXiv:2303.02995 (2023).
- Deep learning approaches on image captioning: A review. Comput. Surveys 56, 3 (2023), 1–39.
- Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems 35 (2022), 6704–6719.
- John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. In proceedings of the 26th annual international conference on machine learning. 377–384.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision. 8340–8349.
- Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15262–15271.
- OpenCLIP. https://doi.org/10.5281/zenodo.5143773 If you use this software, please cite it as below..
- A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2s (2023), 1–41.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
- Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 133–142.
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
- Learning multiple layers of features from tiny images. (2009).
- Yogesh Kumar and Pekka Marttinen. 2024. Improving Medical Multi-modal Contrastive Learning with Expert Annotations. arXiv preprint arXiv:2403.10153 (2024).
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208 (2021).
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems 35 (2022), 17612–17625.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
- Tie-Yan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
- Vision-and-language pretrained models: A survey. arXiv preprint arXiv:2204.07356 (2022).
- R Duncan Luce. 2005. Individual choice behavior: A theoretical analysis. Courier Corporation.
- Fine-Grained Visual Classification of Aircraft. Technical Report. arXiv:1306.5151 [cs-cv]
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
- Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning. PMLR, 7721–7735.
- Slip: Self-supervision meets language-image pre-training. In European conference on computer vision. Springer, 529–544.
- Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Vol. 2011. Granada, Spain, 7.
- Geodesic multi-modal mixup for robust fine-tuning. Advances in Neural Information Processing Systems 36 (2024).
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3498–3505.
- Combined scaling for zero-shot transfer learning. Neurocomputing 555 (2023), 126658.
- Robin L Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics 24, 2 (1975), 193–202.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.
- A review of generalized zero-shot learning methods. IEEE transactions on pattern analysis and machine intelligence 45, 4 (2022), 4051–4070.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Zero-shot text-to-image generation. In International conference on machine learning. Pmlr, 8821–8831.
- Do imagenet classifiers generalize to imagenet?. In International conference on machine learning. PMLR, 5389–5400.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15638–15650.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631–1642.
- Siddharth Srivastava and Gaurav Sharma. 2024. Omnivec: Learning robust representations with cross modal sharing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1236–1248.
- Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks 0 (2012), –. https://doi.org/10.1016/j.neunet.2012.02.016
- A fistful of words: Learning transferable visual models from bag-of-words supervision. arXiv preprint arXiv:2112.13884 (2021).
- Deep learning for computer vision: A brief review. Computational intelligence and neuroscience 2018 (2018).
- Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems 32 (2019).
- Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning. PMLR, 9929–9939.
- Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning. 1192–1199.
- Alip: Adaptive language-image pre-training with synthetic caption. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2922–2931.
- Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783 (2021).
- Multi-Modality Guidance Network For Missing Modality Inference. arXiv preprint arXiv:2309.03452 (2023).
- Yiming Zhang (128 papers)
- Zhuokai Zhao (21 papers)
- Zhaorun Chen (28 papers)
- Zhili Feng (22 papers)
- Zenghui Ding (4 papers)
- Yining Sun (8 papers)