Evaluating Vision-LLMs' Comprehension of Composition and Order
Introduction
Vision and LLMs (VLMs) have shown remarkable capabilities in various benchmark tasks, yet their proficiency in comprehending compositional relationships, attributes, and order remains underexplored. Through the Attribution, Relation, and Order (ARO) benchmark, this paper systematically examines these aspects of VLM understanding. ARO involves more than 50,000 test cases across four tasks, offering a comprehensive evaluation of how well VLMs grasp object properties, relational dynamics, and sequential information in visual narratives. Our findings reveal that despite being trained on extensive datasets rich in compositional detail, current VLMs exhibit substantial deficiencies in these areas.
The ARO Benchmark
The ARO benchmark is designed to probe VLMs' understanding in three principal domains:
- Visual Genome Attribution and Relation: These two tasks assess models' abilities to comprehend the attributes of objects and their relational dynamics within an image, respectively. The challenge lies in distinguishing correct from incorrect attributions or relations, such as identifying "the horse is eating the grass" as correct over "the grass is eating the horse."
- COCO-Order & Flickr30k-Order: These tasks focus on the models' sensitivity to the order of information, presenting VLMs with both correctly ordered captions and their permutations. Models must identify the caption that accurately describes the image sequence.
The performance of several leading VLMs, including CLIP, BLIP, FLAVA, and X-VLM, was evaluated. The results indicate a pervasive struggle across models to accurately represent compositional information, with particular difficulty in relational understanding and order sensitivity.
Limitations of Current Evaluation Protocols
A closer look at standard evaluation metrics and training procedures offers insights into the observed limitations. Notably, VLMs can achieve high performance on image-text retrieval tasks — a common evaluation metric — without accurately comprehending order or composition. This raises questions about the adequacy of such tasks in capturing the depth of VLMs' understanding.
Further examination suggests that the prevailing contrastive pretraining approach aligns closely with retrieval task objectives. However, this alignment, combined with the absence of a strong emphasis on compositional variations in training datasets, may inadvertently encourage models to overlook richer compositional and sequential details. In essence, models may not perceive a necessity to encode this information deeply, as it does not significantly impact their performance on the tasks for which they are optimized.
Advancing Compositional Understanding
In response to these findings, this paper proposes a modification to the traditional training methodology through composition-aware hard negative mining. By incorporating more challenging negatives that emphasize compositional distinctions and order relevance, we demonstrate that VLMs can indeed develop a more refined understanding of these aspects. Experimental results reveal that this straightforward modification notably improves VLM performance on tasks requiring deep compositional understanding, without compromising their capabilities in other benchmark tasks.
Conclusion
Our research presents a critical evaluation of VLMs' abilities to comprehend compositional relationships, attributes, and order information, employing the ARO benchmark. The results expose significant shortcomings in current models, suggesting a need for reevaluating existing training and evaluation practices. By incorporating composition-aware hard negatives in the training process, we offer a viable path forward in enhancing VLMs' understanding of complex visual narratives. As the field continues to progress, fostering a deeper comprehension of composition and order within VLMs remains an imperative pursuit, promising advancements in their applicability and performance across a broader array of tasks.