Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition (2406.09388v1)

Published 13 Jun 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Vision and LLMs (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment. This paper explores the intricate relationship between compositionality and recognition -- two pivotal aspects of VLM capability. We conduct a comprehensive evaluation of existing VLMs, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality. Our evaluation employs 12 benchmarks for compositionality, along with 21 zero-shot classification and two retrieval benchmarks for recognition. In our analysis from 274 CLIP model checkpoints, we reveal patterns and trade-offs that emerge between compositional understanding and recognition accuracy. Ultimately, this necessitates strategic efforts towards developing models that improve both capabilities, as well as the meticulous formulation of benchmarks for compositionality. We open our evaluation framework at https://github.com/ytaek-oh/vl_compo.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Youngtaek Oh (7 papers)
  2. Pyunghwan Ahn (8 papers)
  3. Jinhyung Kim (12 papers)
  4. Gwangmo Song (3 papers)
  5. Soonyoung Lee (10 papers)
  6. In So Kweon (156 papers)
  7. Junmo Kim (90 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub