Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RankCLIP: Ranking-Consistent Language-Image Pretraining (2404.09387v2)

Published 15 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-LLMs in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Food-101 – Mining Discriminative Components with Random Forests. In European Conference on Computer Vision.
  2. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning. 89–96.
  3. Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23-581 (2010), 81.
  4. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. 129–136.
  5. Nicholas Carlini and Andreas Terzis. 2021. Poisoning and backdooring contrastive learning. arXiv preprint arXiv:2106.09667 (2021).
  6. Vlp: A survey on vision-language pre-training. Machine Intelligence Research 20, 1 (2023), 38–56.
  7. HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding. arXiv preprint arXiv:2403.00425 (2024).
  8. AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition. arXiv preprint arXiv:2402.11452 (2024).
  9. KR1442 Chowdhary and KR Chowdhary. 2020. Natural language processing. Fundamentals of artificial intelligence (2020), 603–649.
  10. Sanghyuk Chun. 2023. Improved probabilistic image-text representations. arXiv preprint arXiv:2305.18171 (2023).
  11. Describing Textures in the Wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  12. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 215–223.
  13. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision. Springer, 88–105.
  14. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  16. A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936 (2022).
  17. Softclip: Softer cross-modal alignment makes clip stronger. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1860–1868.
  18. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining. Advances in neural information processing systems 35 (2022), 35959–35970.
  19. HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention. arXiv preprint arXiv:2303.02995 (2023).
  20. Deep learning approaches on image captioning: A review. Comput. Surveys 56, 3 (2023), 1–39.
  21. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems 35 (2022), 6704–6719.
  22. John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. In proceedings of the 26th annual international conference on machine learning. 377–384.
  23. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  24. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision. 8340–8349.
  25. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15262–15271.
  26. OpenCLIP. https://doi.org/10.5281/zenodo.5143773 If you use this software, please cite it as below..
  27. A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2s (2023), 1–41.
  28. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  29. Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 133–142.
  30. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
  31. Learning multiple layers of features from tiny images. (2009).
  32. Yogesh Kumar and Pekka Marttinen. 2024. Improving Medical Multi-modal Contrastive Learning with Expert Annotations. arXiv preprint arXiv:2403.10153 (2024).
  33. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208 (2021).
  34. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems 35 (2022), 17612–17625.
  35. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
  36. Tie-Yan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
  37. Vision-and-language pretrained models: A survey. arXiv preprint arXiv:2204.07356 (2022).
  38. R Duncan Luce. 2005. Individual choice behavior: A theoretical analysis. Courier Corporation.
  39. Fine-Grained Visual Classification of Aircraft. Technical Report. arXiv:1306.5151 [cs-cv]
  40. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
  41. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning. PMLR, 7721–7735.
  42. Slip: Self-supervision meets language-image pre-training. In European conference on computer vision. Springer, 529–544.
  43. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Vol. 2011. Granada, Spain, 7.
  44. Geodesic multi-modal mixup for robust fine-tuning. Advances in Neural Information Processing Systems 36 (2024).
  45. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  46. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3498–3505.
  47. Combined scaling for zero-shot transfer learning. Neurocomputing 555 (2023), 126658.
  48. Robin L Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics 24, 2 (1975), 193–202.
  49. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.
  50. A review of generalized zero-shot learning methods. IEEE transactions on pattern analysis and machine intelligence 45, 4 (2022), 4051–4070.
  51. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  52. Zero-shot text-to-image generation. In International conference on machine learning. Pmlr, 8821–8831.
  53. Do imagenet classifiers generalize to imagenet?. In International conference on machine learning. PMLR, 5389–5400.
  54. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
  55. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.
  56. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15638–15650.
  57. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631–1642.
  58. Siddharth Srivastava and Gaurav Sharma. 2024. Omnivec: Learning robust representations with cross modal sharing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1236–1248.
  59. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks 0 (2012), –. https://doi.org/10.1016/j.neunet.2012.02.016
  60. A fistful of words: Learning transferable visual models from bag-of-words supervision. arXiv preprint arXiv:2112.13884 (2021).
  61. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience 2018 (2018).
  62. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems 32 (2019).
  63. Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning. PMLR, 9929–9939.
  64. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning. 1192–1199.
  65. Alip: Adaptive language-image pre-training with synthetic caption. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2922–2931.
  66. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783 (2021).
  67. Multi-Modality Guidance Network For Missing Modality Inference. arXiv preprint arXiv:2309.03452 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yiming Zhang (128 papers)
  2. Zhuokai Zhao (21 papers)
  3. Zhaorun Chen (28 papers)
  4. Zhili Feng (22 papers)
  5. Zenghui Ding (4 papers)
  6. Yining Sun (8 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com