Vision LLMs and Their Limitations in Recognizing Virtual Objects
In the paper conducted by Tran, Khemlani, and Trafton, a critical evaluation was carried out on the capabilities of Vision LLMs (VLMs) in processing virtual objects. VLMs are designed to process and comprehend both visual and textual inputs to perform multimodal tasks. While these models excel at tasks such as automatic captioning and image tagging, this paper questions their ability to comprehend visuospatial relationships involving non-depicted, imaginary objects.
The paper explores whether VLMs can accurately update their internal scenes when prompted with virtual objects not directly depicted in images. For instance, when shown an image of a person under a tree and given the prompt to imagine a kite stuck in the tree, a VLM should be able to infer the spatial relationship between the three entities (the person, tree, and kite). The paper systematically evaluated this imagination capability across multiple state-of-the-art VLMs, namely Idefics2, InstructBlip-Vicuna, and Llama 3.2.
Methodology
The researchers benchmarked virtual object recognition using the TableTest dataset, which comprises synthetic images of 1-3 objects. The dataset includes 64 distinct objects from the Objaverse dataset, strategically positioned to examine spatial relations. The evaluation employed a series of prompts categorized along three dimensions: phrasing variations, verb tense, and numerical cues in object lists.
Prompts instructed VLMs to imagine additional objects near those in the images, querying them to list all perceived and imagined objects. Performance metrics focused on whether the models could correctly integrate imaginary virtual objects with depicted ones.
Findings
The findings reveal significant deficiencies in VLM capabilities to recognize and incorporate virtual objects into scene representations. The paper found significant discrepancies among the tested models: Idefics2 achieved a 63% accuracy rate, Llama 3 achieved 57%, and BLIP was notably lower at 22%. Analysis indicated that the nature of phrasing in prompts influenced the interpretations inconsistently, with the prompt using "pretend" being the most successful and "if" the least.
An evaluation of prompt tense revealed a bias, with past-tense prompts yielding better performance than present-tense ones. Additionally, the inclusion of numerical cues significantly improved VLM performance, suggesting these cues help the models better process virtual additions to scenes.
Implications and Future Directions
This paper underscores the apparent limitations in the spatial cognition capabilities of current VLMs when handling virtual objects. The inability to track non-depicted objects questions the extent of VLMs' true spatial reasoning abilities. Future AI developments may need to integrate human-like discrete mental models for effective imagination and flexible reasoning. Such advancements could enhance VLMs' potential applications in creative and hypothetical scenarios where imagination plays a significant role.
Overall, the results caution against assumptions of VLMs' comprehension capabilities beyond typical training datasets' boundaries. Further research is required to explore architectures that can robustly model spatial reasoning involving non-explicitly visualized constructs.