What's "up" with vision-language models? Investigating their struggle with spatial reasoning (2310.19785v1)

Published 30 Oct 2023 in cs.CL, cs.CV, and cs.LG

Abstract: Recent vision-language (VL) models are powerful, but can they reliably distinguish "right" from "left"? We curate three new corpora to quantify model comprehension of such basic spatial relations. These tests isolate spatial reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping their identity fixed (see Figure 1: models must comprehend not only the usual case of a dog under a table, but also, the same dog on top of the same table). We evaluate 18 VL models, finding that all perform poorly, e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56% accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of this surprising behavior, finding: 1) that popular vision-language pretraining corpora like LAION-2B contain little reliable data for learning spatial relationships; and 2) that basic modeling interventions like up-weighting preposition-containing instances or fine-tuning on our corpora are not sufficient to address the challenges our benchmarks pose. We are hopeful that these corpora will facilitate further research, and we release our data and code at https://github.com/amitakamath/whatsup_vlms.

Citations (59)

View on Semantic Scholar

Summary

The paper introduces novel benchmarks that isolate spatial reasoning, demonstrating that VL models score around 40% compared to nearly 100% human performance.
The evaluation shows that despite strong results on other tasks, these models consistently struggle with basic spatial preposition tasks.
Attempts to improve spatial reasoning through finetuning and caption-based re-normalization produced minimal gains, highlighting the need for structured pre-training and new architectural designs.

Spatial Reasoning Challenges in Vision-LLMs: An In-Depth Evaluation

The paper "What's 'up' with vision-LLMs? Investigating their struggle with spatial reasoning" presents a focused investigation into the capabilities of contemporary vision-language (VL) models in handling basic spatial reasoning tasks. This area has intrinsic importance as spatial reasoning is a fundamental aspect of human cognition and is pivotal to a host of applications in autonomous systems and human-computer interaction. The authors rigorously analyze 18 popular vision-LLMs, evaluating their performance on newly curated benchmarks specifically designed to test spatial reasoning involving basic prepositions such as "left of," "right of," "on," and "under."

Challenges Addressed

The paper targets a persistent issue in VL models: their underwhelming performance on spatial reasoning despite their noted success on complex visual tasks such as the VQAv2 benchmark. The research zeroes in on the VL models' proficiency to discern spatial relations through three novel datasets: COCO-spatial, GQA-spatial, and What'sUp. By simplifying scenarios to focus strictly on spatial relations, as opposed to conflating multiple layers of reasoning, the authors provide a more precise evaluation of spatial reasoning in models.

Methodological Contributions

Benchmark Development: The creation of the What'sUp benchmark involved controlled photographic experiments reflecting simple spatial relationships. This dataset stands out by eliminating the reliance on text-derived biases found in datasets like COCO, thus purifying the spatial relation aspect of evaluation.
Model Evaluation: Among the 18 evaluated models, even those achieving near human-parity on general benchmarks perform poorly on spatial reasoning tasks. This is quantitatively demonstrated as models' accuracies range close to random guessing levels (approximately 40% for What'sUp) when compared to a near-100% human performance.
Corpus Analysis: By examining the LAION-2B dataset, the authors highlight the rarity and ambiguity of spatial prepositions. This offers insights into potential pre-training data deficiencies that could explain models' shortcomings in spatial reasoning.
Model Enhancement Attempts: Efforts to improve spatial reasoning through various methods such as re-normalization by caption priors and finetuning failed to produce significant performance enhancements, underscoring the complexity of the challenge.

Implications and Future Directions

The findings underscore a critical limitation in VL model architectures and their training data, calling for future research in several directions:

Pre-training Regimens: Future models might benefit from pre-training datasets that feature more structured spatial relations, potentially integrating synthetically generated hard negatives to enforce learning.
Architecture Adjustments: The paper advocates for models with more expressive architectures, such as those leveraging dense cross-attention mechanisms, which might naturally align more closely with tasks requiring fine-grained spatial understanding.
Task-Specific Finetuning: Although extensive supervised training showed some improvements, the authors suggest that achieving robust spatial reasoning will likely require novel inductive biases within model architectures or pre-training strategies tailored to spatial tasks.

Conclusion

This research spotlights a serious deficiency in existing VL models regarding spatial reasoning. It presents comprehensive benchmark datasets and offers insights into potential reasons for current model failures. While current model architectures still have significant strides to make towards human-level spatial comprehension, the detailed investigations and analyses in this work lay crucial groundwork. The shared data and insights promise to facilitate further exploration and advancements aimed at overcoming these pressing challenges.

PDF Markdown

Related Papers

GitHub

GitHub - amitakamath/whatsup_vlms: Code and datasets for "What’s “up” with vision-language models? Investigating their struggle with spatial reasoning". (37 stars)

Tweets

https://twitter.com/billyuchenlin/status/1846695133431189696

https://twitter.com/kamath_amita/status/1755677951981707401