Papers
Topics
Authors
Recent
2000 character limit reached

Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation (2305.04474v3)

Published 8 May 2023 in cs.CV and cs.AI

Abstract: Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086.
  2. Robust cross-modal representation learning with progressive self-distillation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16409–16420.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  4. Helen Suzanna Becker. 1993. An information-theoretic unsupervised learning algorithm for neural networks.
  5. Suzanna Becker. 1996. Mutual information maximization: models of cortical self-organization. Network, 7 1:7–31.
  6. A simple framework for contrastive learning of visual representations. ArXiv, abs/2002.05709.
  7. Incremental false negative detection for contrastive learning. arXiv preprint arXiv:2106.03719.
  8. Improved baselines with momentum contrastive learning. ArXiv, abs/2003.04297.
  9. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325.
  10. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  13. An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387.
  14. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735.
  15. Adco: Adversarial contrast for efficient learning of unsupervised representations from self-trained negative adversaries. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1074–1083.
  16. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
  17. A survey on contrastive self-supervised learning. ArXiv, abs/2011.00362.
  18. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918.
  19. Hard negative mixing for contrastive learning. ArXiv, abs/2010.01028.
  20. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
  21. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334.
  22. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
  23. mplug: Effective and efficient vision-language learning by cross-modal skip-connections.
  24. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086.
  25. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34.
  26. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
  27. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 35:857–876.
  28. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  29. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23.
  30. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  31. Im2text: Describing images using 1 million captioned photographs. In Advances in neural information processing systems, pages 1143–1151.
  32. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
  33. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
  34. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  35. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383.
  36. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491.
  37. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
  38. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052.
  39. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358.
  40. Simvlm: Simple visual language model pretraining with weak supervision. CoRR, abs/2108.10904.
  41. Approximate nearest neighbor negative contrastive learning for dense text retrieval. ArXiv, abs/2007.00808.
  42. E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning. arXiv preprint arXiv:2106.01804.
  43. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783.
  44. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3208–3216.
  45. Vinvl: Making visual representations matter in vision-language models. CoRR, abs/2101.00529.
Citations (13)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.