Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger (2303.17561v2)

Published 30 Mar 2023 in cs.CV and cs.AI

Abstract: During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16430–16441, 2022.
  2. Food-101 – mining discriminative components with random forests. In Proc. European Conf. Computer Vision, 2014.
  3. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  4. Describing textures in the wild. In Proceedings of the IEEE international conference on computer vision, 2014.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Evaluation of gist descriptors for web-scale image search. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 1–8, 2009.
  8. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Proceedings of the IEEE international conference on computer vision workshop, pages 178–178. IEEE, 2004.
  9. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining. arXiv preprint arXiv:2204.14095, 2022.
  10. Cyclip: Cyclic contrastive language-image pretraining. arXiv preprint arXiv:2205.14459, 2022.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  12. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  13. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
  14. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  15. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  16. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proc. European Conf. Computer Vision, pages 121–137, 2020.
  17. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
  18. Microsoft coco: Common objects in context. In Proc. European Conf. Computer Vision, pages 740–755, 2014.
  19. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  20. Mixed precision training. In International Conference on Learning Representations, 2018.
  21. Slip: Self-supervision meets language-image pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pages 529–544. Springer, 2022.
  22. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
  23. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  24. Cats and dogs. In Proceedings of the IEEE international conference on computer vision, pages 3498–3505. IEEE, 2012.
  25. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007.
  26. Lost in quantization: Improving particular object retrieval in large scale image databases. In 2008 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008.
  27. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5706–5715, 2018.
  28. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  29. Faster R-CNN: Towards real-time object detection with region proposal networks. Proc. Advances in Neural Information Processing Systems, 28, 2015.
  30. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  31. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  32. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  33. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Sun database: Large-scale scene recognition from abbey to zoo. In IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  36. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  37. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  38. Vinvl: Revisiting visual representations in vision-language models. In Proc. IEEE/CVF Conf. Computer Vision & Pattern Recognition, pages 5579–5588, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuting Gao (25 papers)
  2. Jinfeng Liu (59 papers)
  3. Zihan Xu (31 papers)
  4. Tong Wu Enwei Zhang (1 paper)
  5. Wei Liu (1135 papers)
  6. Jie Yang (516 papers)
  7. Ke Li (723 papers)
  8. Xing Sun (94 papers)
Citations (35)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets