The paper introduces the concept of cross-modal consistency in multimodal LLMs and presents a quantitative evaluation framework to measure it. The authors argue that existing evaluations of Vision LLMs (VLLMs) often focus on individual modality performance, neglecting the interaction and consistency between modalities. The research focuses on the discrepancies in capability between modalities, especially vision and language.
The authors define cross-modal consistency as the degree to which a multimodal model produces the same output when presented with the same task instance across different modalities, assuming the information necessary for solving the task is preserved during modality conversion. They propose that a model exhibits consistency between modalities and if:
Where:
- is the multimodal model
- is a data element from input space corresponding to modality
- is the abstract query
- is an information-preserving converter mapping data elements from modality to
To evaluate cross-modal consistency, the authors construct a vision-language parallel dataset spanning seven tasks:
- Math Equation Solving (easy and hard)
- Logical Reasoning
- Table Understanding
- State Machine Reasoning
- Reading Comprehension
These datasets are designed such that data instances can be converted between image and text formats while preserving task-related information, using Optical Character Recognition (OCR) and screenshot software as converters.
The core of the evaluation framework involves comparing a model's performance on paired instances and and the task consistency score is computed as:
where
The authors conduct experiments using GPT-4V, evaluating its cross-modal consistency on the constructed datasets. The results reveal significant inconsistencies between the vision and language modalities. GPT-4V demonstrates varying performance depending on whether the task is prompted in one modality versus the other. In tasks involving intricate reasoning, such as equation solving, math/logical reasoning, and state machine reasoning, the model exhibits lower accuracy with image inputs compared to text inputs. Conversely, in tasks focused on information extraction and comprehension, such as language understanding and table understanding, the model shows near-perfect performance with text inputs but a substantial drop in accuracy with image inputs.
To investigate whether the observed performance gap is due to the model's inability to access information from images, the authors conduct an ablation paper involving Optical Character Recognition (OCR) on image inputs. The results indicate that the model can accurately extract information from images, suggesting that the performance gap is primarily attributable to the model's internal reasoning processes for each modality. Conditional consistency scores are also reported for image instances, considering correct versus incorrect Optical Character Recognition (OCR) results.
To address the identified cross-modal inconsistency, the authors introduce a method called Vision-Depicting-Prompting (VDP). VDP involves a two-step process:
- Prompting the model to extract and articulate a textual description of the image task.
- Prompting the model to provide an answer, considering both the textual description and the original image input.
The experimental results demonstrate that VDP improves accuracy in vision-based tasks compared to naive prompting. In tasks requiring reasoning abilities, VDP yields an average accuracy enhancement of 19%. In tasks centered around understanding, VDP achieves an average accuracy increase of 57%, with performance reaching parity with text-based prompting in some cases. The authors also observe a substantial increase in the consistency score with VDP compared to prompting with plain images.
The authors conclude that multimodal systems like GPT-4V maintain relatively independent internal representations of reasoning between visual and textual signals. They suggest that these findings offer insights into the potential applications of multimodal systems and highlight the need for more integrated system designs. The introduction of the Vision-depicting-Prompting (VDP) solution is presented as an effective approach to address cross-modal inconsistency.