Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples (2402.13254v4)

Published 20 Feb 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V. To facilitate future research, we release our code, dataset, benchmark, and checkpoints at https://countercurate.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736.
  2. Improving image generation with better captions.
  3. Making large multimodal models understand arbitrary visual prompts. In arXiv:2312.00784.
  4. PaLI: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations.
  5. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  6. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
  7. Why is winoground hard? investigating failures in visuolinguistic compositionality. arXiv preprint arXiv:2211.00768.
  8. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
  9. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.
  10. Openclip.
  11. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
  12. Coco-counterfactuals: Automatically constructed counterfactual examples for image-text pairs.
  13. Gligen: Open-set grounded text-to-image generation.
  14. Microsoft coco: Common objects in context.
  15. Visualgptscore: Visio-linguistic reasoning with multimodal generative pre-training scores. arXiv preprint arXiv:2306.01879.
  16. Improved baselines with visual instruction tuning.
  17. Visual instruction tuning. arXiv:2304.08485.
  18. Point and ask: Incorporating pointing into visual question answering.
  19. Stephen L Morgan and Christopher Winship. 2015. Counterfactuals and causal inference. Cambridge University Press.
  20. OpenAI. 2023a. Chatgpt. https://openai.com/blog/chatgpt/.
  21. OpenAI. 2023b. Gpt-4 technical report.
  22. OpenAI. 2023c. Gpt-4v(ision) system card.
  23. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.
  24. Learning transferable visual models from natural language supervision.
  25. Laion-5b: An open large-scale dataset for training next generation image-text models.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  27. What you see is what you read? improving text-image alignment evaluation. In Thirty-seventh Conference on Neural Information Processing Systems.
  28. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.
  29. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations.
  30. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jianrui Zhang (6 papers)
  2. Mu Cai (21 papers)
  3. Tengyang Xie (29 papers)
  4. Yong Jae Lee (88 papers)
Citations (9)