Vision language models have difficulty recognizing virtual objects (2505.10453v1)

Published 15 May 2025 in cs.CV and cs.AI

Abstract: Vision LLMs (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.

Summary

Vision LLMs and Their Limitations in Recognizing Virtual Objects

In the paper conducted by Tran, Khemlani, and Trafton, a critical evaluation was carried out on the capabilities of Vision LLMs (VLMs) in processing virtual objects. VLMs are designed to process and comprehend both visual and textual inputs to perform multimodal tasks. While these models excel at tasks such as automatic captioning and image tagging, this paper questions their ability to comprehend visuospatial relationships involving non-depicted, imaginary objects.

The paper explores whether VLMs can accurately update their internal scenes when prompted with virtual objects not directly depicted in images. For instance, when shown an image of a person under a tree and given the prompt to imagine a kite stuck in the tree, a VLM should be able to infer the spatial relationship between the three entities (the person, tree, and kite). The paper systematically evaluated this imagination capability across multiple state-of-the-art VLMs, namely Idefics2, InstructBlip-Vicuna, and Llama 3.2.

Methodology

The researchers benchmarked virtual object recognition using the TableTest dataset, which comprises synthetic images of 1-3 objects. The dataset includes 64 distinct objects from the Objaverse dataset, strategically positioned to examine spatial relations. The evaluation employed a series of prompts categorized along three dimensions: phrasing variations, verb tense, and numerical cues in object lists.

Prompts instructed VLMs to imagine additional objects near those in the images, querying them to list all perceived and imagined objects. Performance metrics focused on whether the models could correctly integrate imaginary virtual objects with depicted ones.

Findings

The findings reveal significant deficiencies in VLM capabilities to recognize and incorporate virtual objects into scene representations. The paper found significant discrepancies among the tested models: Idefics2 achieved a 63% accuracy rate, Llama 3 achieved 57%, and BLIP was notably lower at 22%. Analysis indicated that the nature of phrasing in prompts influenced the interpretations inconsistently, with the prompt using "pretend" being the most successful and "if" the least.

An evaluation of prompt tense revealed a bias, with past-tense prompts yielding better performance than present-tense ones. Additionally, the inclusion of numerical cues significantly improved VLM performance, suggesting these cues help the models better process virtual additions to scenes.

Implications and Future Directions

This paper underscores the apparent limitations in the spatial cognition capabilities of current VLMs when handling virtual objects. The inability to track non-depicted objects questions the extent of VLMs' true spatial reasoning abilities. Future AI developments may need to integrate human-like discrete mental models for effective imagination and flexible reasoning. Such advancements could enhance VLMs' potential applications in creative and hypothetical scenarios where imagination plays a significant role.