Evaluation of Context-Sensitive Text-Rich Visual Reasoning
Introduction
The advent of instruction-tuned large multimodal models (LMMs) has led to heightened capabilities in responding to human instructions over images. Recent datasets have focused on assessing the Optical Character Recognition (OCR) ability of models, but this falls short in testing the full potential of LMMs to jointly reason over the text and visual context in an image. To bridge this gap, the paper introduces the benchmark C ON TEXTUAL, designed to evaluate the LMMs’ ability to perform context-sensitive reasoning over diverse and challenging real-world scenarios.
C ON T EXTUAL Dataset
C ON TEXTUAL consists of 506 challenging instructions testing LMMs across eight visual scenarios representing daily-life natural or digital scenes. This dataset demands a joint reasoning between the textual and visual cues, something prior datasets do not incentivize sufficiently. The instructions include open-ended questions and imperative tasks, demanding extractive as well as reasoning capabilities beyond information extraction, including mathematical reasoning.
Experimental Setup and Findings
A comprehensive set of experiments were conducted with 13 foundation models, including both proprietary (e.g., GPT-4V, Gemini-Pro-Vision) and open LMMs (e.g., LLaVA-1.5). The findings indicate GPT-4V(ision) outstripping other LMMs, even though it still lags behind human performance by 30.8%. There is a notable performance disparity in open LMMs in comparison to proprietary models, pointing to a need for future advancements that narrow this divide.
Model Performance and Analysis
Qualitative analysis elucidates a range of performance levels, with GPT4V and Gemini-Pro-Vision showcasing superior context-sensitive text-rich visual reasoning, whereas open-source LMMs underperform considerably. The analysis further helps identify issues like hallucination and lack of grounding instructions to the image. Interestingly, in certain abstract categories like memes and quotes, GPT-4V exceeds human performance, indicating the potential for tuning LMMs for better visual context understanding. Additionally, the benchmark C ON T EXTUAL demonstrates the challenging nature and gap present in modern LMMs when it comes to context-sensitive text-rich visual reasoning tasks.