Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models (2303.16133v2)
Abstract: As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs. Measuring consistency between very heterogeneous tasks that might include outputs in different modalities is challenging since it is difficult to determine if the predictions are consistent with one another. As a solution, we introduce a benchmark dataset, CocoCon, where we create contrast sets by modifying test instances for multiple tasks in small but semantically meaningful ways to change the gold label and outline metrics for measuring if a model is consistent by ranking the original and perturbed instances across tasks. We find that state-of-the-art vision-LLMs suffer from a surprisingly high degree of inconsistent behavior across tasks, especially for more heterogeneous tasks. To alleviate this issue, we propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets, that improves the multi-task consistency of large unified models while retaining their original accuracy on downstream tasks.
- Generating natural language adversarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2890–2896, 2018.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
- Automatic generation of contrast sets from scene graphs: Probing the compositional consistency of gqa. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 94–105, 2021.
- Fast differentiable sorting and ranking. In International Conference on Machine Learning, pp. 950–959. PMLR, 2020.
- Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pp. 1931–1942. PMLR, 2021.
- Custom-edit: Text-guided image editing with customized diffusion models. CVPR 2023 AI4CC Workshop, 2023.
- Sort-ing vqa models: Contrastive gradient learning for improved consistency. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3103–3111, 2021.
- Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031, 2021. doi: 10.1162/tacl_a_00410. URL https://aclanthology.org/2021.tacl-1.60.
- Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1307–1323, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.117. URL https://aclanthology.org/2020.findings-emnlp.117.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017.
- On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017.
- Towards general purpose vision systems: An end-to-end task-agnostic vision-language architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16399–16409, 2022a.
- Grit: General robust image task benchmark. arXiv preprint arXiv:2204.13653, 2022b.
- Annotation artifacts in natural language inference data. In 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2018, pp. 107–112. Association for Computational Linguistics (ACL), 2018.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7514–7528, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Hand-object interaction image generation. Advances in Neural Information Processing Systems, 35:23805–23817, 2022.
- Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems, 36, 2023.
- How can we know what language models know. Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
- Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision, pp. 662–681. Springer, 2022.
- Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1584–1593, 2021.
- Beliefbank: Adding memory to a pre-trained language model for a systematic notion of belief. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8849–8861, 2021.
- Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations, 2019.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798, 2014.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Generating images with multimodal language models. Advances in Neural Information Processing Systems, 36, 2023.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2022.
- Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023a.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2023b.
- Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022.
- Taskology: Utilizing task relations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8700–8709, 2021.
- Keeping consistency of sentence generation and document classification with multi-task learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3195–3205, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1315. URL https://aclanthology.org/D19-1315.
- A survey on performance metrics for object-detection algorithms. In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 237–242, 2020.
- Kosmos-2: Grounding multimodal large language models to the world. In The Thirteenth International Conference on Learning Representations, 2024.
- Sunny and dark outside?! improving answer consistency in vqa through entailed question generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5860–5865, 2019.
- Are red roses red? evaluating consistency of question-answering models. In Association for Computational Linguistics (ACL), 2019.
- Squinting at vqa models: Introspecting vqa models with sub-questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10000–10008, 2020.
- Cycle-consistency for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6649–6658, 2019.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pp. 23318–23340. PMLR, 2022.
- Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
- Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11197–11206, 2020.
- Taskonomy: Disentangling task transfer learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3712–3722, 2018.
- Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6027–6037, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.