Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning (2402.17510v2)
Abstract: Vision-LLMs (VLMs) mainly rely on contrastive training to learn general-purpose representations of images and captions. We focus on the situation when one image is associated with several captions, each caption containing both information shared among all captions and unique information per caption about the scene depicted in the image. In such cases, it is unclear whether contrastive losses are sufficient for learning task-optimal representations that contain all the information provided by the captions or whether the contrastive learning setup encourages the learning of a simple shortcut that minimizes contrastive loss. We introduce synthetic shortcuts for vision-language: a training and evaluation framework where we inject synthetic shortcuts into image-text data. We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features that represent the shortcut. Hence, contrastive losses are not sufficient to learn task-optimal representations, i.e., representations that contain all task-relevant information shared between the image and associated captions. We examine two methods to reduce shortcut learning in our training and evaluation framework: (i) latent target decoding and (ii) implicit feature modification. We show empirically that both methods improve performance on the evaluation task, but only partly reduce shortcut learning when training and evaluating with our shortcut learning framework. Hence, we show the difficulty and challenge of our shortcut learning framework for contrastive vision-language representation learning.
- Monitoring shortcut learning using mutual information. arXiv preprint arXiv:2206.13034, 2022.
- Fusion of detected objects in text for visual question answering. In EMNLP, pp. 2131–2140, 2019.
- Learning representations by maximizing mutual information across views. In NeurIPS, pp. 15509–15519, 2019.
- Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell., 41:423–443, 2019.
- VLMo: Unified vision-language pre-training with mixture-of-modality-experts. NeurIPS, pp. 32897–32912, 2022.
- VICReg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.
- Is an image worth five sentences? A new look into semantics for image-text matching. In WACV, pp. 2483–2492. IEEE, 2022.
- Reducing predictive feature suppression in resource-constrained contrastive image-caption retrieval. Transactions on Machine Learning Research, 2023.
- A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607, 2020a.
- Intriguing properties of contrastive losses. In NeurIPS, pp. 11834–11845, 2021.
- Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- UNITER: Universal image-text representation learning. In ECCV, pp. 104–120, 2020b.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In ACL, pp. 1724–1734, 2014.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- VSE++: Improving visual-semantic embeddings with hard negatives. In BVCM, pp. 12, 2018.
- Learning robust representations via multi-view information bottleneck. In ICLR, 2020.
- DeViSE: A deep visual-semantic embedding model. NeurIPS, pp. 2121–2129, 2013.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, pp. 665–673, 2020.
- Deep multimodal representation learning: A survey. IEEE Access, 7:63373–63394, 2019.
- Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
- What shapes feature representations? Exploring datasets, architectures, and training. In NeurIPS, pp. 9995–10006, 2020.
- Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pp. 4904–4916, 2021.
- Deep visual-semantic alignments for generating image descriptions. In CVPR, pp. 3128–3137, 2015.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
- Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
- Stacked cross attention for image-text matching. In ECCV, pp. 201–216, 2018.
- Compressive visual representations. In NeurIPS, pp. 19538–19552, 2021.
- Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020a.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, pp. 9694–9705, 2021.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pp. 12888–12900, 2022a.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pp. 19730–19742, 2023a.
- Visual semantic reasoning for image-text matching. In ICCV, pp. 4654–4662, 2019a.
- VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019b.
- Addressing feature suppression in unsupervised visual representations. In WACV, pp. 1411–1420, 2023b.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, pp. 121–137, 2020b.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In ICLR, 2022b.
- Factorized contrastive learning: Going beyond multi-view redundancy. In NeurIPS, 2023.
- Microsoft COCO: Common objects in context. In ECCV, pp. 740–755, 2014.
- Decoupled weight decay regularization. In ICLR, 2019.
- ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, pp. 13–23, 2019.
- SLIP: Self-supervision meets language-image pre-training. In ECCV, pp. 529–544, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, pp. 9, 2019.
- Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763, 2021.
- Sentence-BERT: Sentence embeddings using siamese BERT-Networks. In EMNLP-IJCNLP, pp. 3980–3990, 2019.
- Can contrastive learning avoid shortcut solutions? In NeurIPS, pp. 4974–4986, 2021.
- Which shortcut cues will DNNs choose? A study from the parameter-space perspective. In ICLR, 2022.
- To compress or not to compress–self-supervised learning and information theory: A review. arXiv preprint arXiv:2304.09355, 2023.
- Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- MPNet: Masked and permuted pre-training for language understanding. In NeurIPS, pp. 16857–16867, 2020.
- An information theoretic framework for multi-view learning. In COLT, pp. 403–414, 2008.
- VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR, 2020.
- LXMERT: learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP, pp. 5099–5110, 2019.
- Contrastive multiview coding. In ECCV, pp. 776–794, 2020a.
- What makes for good views for contrastive learning? In NeurIPS, pp. 6827–6839, 2020b.
- Self-supervised learning from a multi-view perspective. In ICLR, 2021.
- On mutual information maximization for representation learning. In ICLR, 2020.
- Image captioners are scalable vision learners too. In NeurIPS, 2023.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Rethinking minimal sufficient representation in contrastive learning. In CVPR, pp. 16020–16029, 2022.
- Image as a foreign language: BEIT pretraining for vision and vision-language tasks. In CVPR, pp. 19175–19186, 2023.
- A fine-grained analysis on distribution shift. In ICLR, 2022.
- What should not be contrastive in contrastive learning. In ICLR, 2021.
- FILIP: Fine-grained interactive language-image pre-training. In ICLR, 2022.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2023.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. In ICML, pp. 25994–26009, 2022.
- Multi-view learning overview: Recent progress and new challenges. Inf. Fusion, 38:43–54, 2017.
- Self-supervised multimodal learning: A survey. arXiv preprint arXiv:2304.01008, 2023.
- Maurits Bleeker (10 papers)
- Mariya Hendriksen (11 papers)
- Andrew Yates (60 papers)
- Maarten de Rijke (263 papers)