CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples (2402.13254v4)
Abstract: We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V. To facilitate future research, we release our code, dataset, benchmark, and checkpoints at https://countercurate.github.io.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736.
- Improving image generation with better captions.
- Making large multimodal models understand arbitrary visual prompts. In arXiv:2312.00784.
- PaLI: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
- Why is winoground hard? investigating failures in visuolinguistic compositionality. arXiv preprint arXiv:2211.00768.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
- Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.
- Openclip.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
- Coco-counterfactuals: Automatically constructed counterfactual examples for image-text pairs.
- Gligen: Open-set grounded text-to-image generation.
- Microsoft coco: Common objects in context.
- Visualgptscore: Visio-linguistic reasoning with multimodal generative pre-training scores. arXiv preprint arXiv:2306.01879.
- Improved baselines with visual instruction tuning.
- Visual instruction tuning. arXiv:2304.08485.
- Point and ask: Incorporating pointing into visual question answering.
- Stephen L Morgan and Christopher Winship. 2015. Counterfactuals and causal inference. Cambridge University Press.
- OpenAI. 2023a. Chatgpt. https://openai.com/blog/chatgpt/.
- OpenAI. 2023b. Gpt-4 technical report.
- OpenAI. 2023c. Gpt-4v(ision) system card.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.
- Learning transferable visual models from natural language supervision.
- Laion-5b: An open large-scale dataset for training next generation image-text models.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- What you see is what you read? improving text-image alignment evaluation. In Thirty-seventh Conference on Neural Information Processing Systems.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.
- When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588.
- Jianrui Zhang (6 papers)
- Mu Cai (21 papers)
- Tengyang Xie (29 papers)
- Yong Jae Lee (88 papers)