Persistence of Visual Understanding Weakness in Foundation Models

Ascertain whether contemporary foundation models will continue to exhibit relative weakness in visual understanding and spatial reasoning compared to textual capabilities, and determine if this disparity persists as model development progresses.

Background

The paper surveys benchmarks that exploit current models’ comparatively weaker performance on visual and spatial tasks relative to text-only reasoning, citing examples such as ZeroBench, ARC-AGI, OSWorld, and WebCanvas. The authors caution that this observed imbalance may change, which would affect the relevance and predictive power of such benchmarks for real-world usefulness.

Understanding whether this trend endures is important for designing future benchmarks and for assessing agents’ capabilities in environments that require visual comprehension and spatial reasoning, especially as specialized tooling and model architectures evolve.

References

It is unclear if the trend of relative weakness in visual understanding will continue.

HCAST: Human-Calibrated Autonomy Software Tasks (2503.17354 - Rein et al., 21 Mar 2025) in Appendix, Section Related Work