Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Modal Consistency in Multimodal Large Language Models (2411.09273v1)

Published 14 Nov 2024 in cs.CL and cs.AI
Cross-Modal Consistency in Multimodal Large Language Models

Abstract: Recent developments in multimodal methodologies have marked the beginning of an exciting era for models adept at processing diverse data types, encompassing text, audio, and visual content. Models like GPT-4V, which merge computer vision with advanced language processing, exhibit extraordinary proficiency in handling intricate tasks that require a simultaneous understanding of both textual and visual information. Prior research efforts have meticulously evaluated the efficacy of these Vision LLMs (VLLMs) in various domains, including object detection, image captioning, and other related fields. However, existing analyses have often suffered from limitations, primarily centering on the isolated evaluation of each modality's performance while neglecting to explore their intricate cross-modal interactions. Specifically, the question of whether these models achieve the same level of accuracy when confronted with identical task instances across different modalities remains unanswered. In this study, we take the initiative to delve into the interaction and comparison among these modalities of interest by introducing a novel concept termed cross-modal consistency. Furthermore, we propose a quantitative evaluation framework founded on this concept. Our experimental findings, drawn from a curated collection of parallel vision-language datasets developed by us, unveil a pronounced inconsistency between the vision and language modalities within GPT-4V, despite its portrayal as a unified multimodal model. Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.

The paper introduces the concept of cross-modal consistency in multimodal LLMs and presents a quantitative evaluation framework to measure it. The authors argue that existing evaluations of Vision LLMs (VLLMs) often focus on individual modality performance, neglecting the interaction and consistency between modalities. The research focuses on the discrepancies in capability between modalities, especially vision and language.

The authors define cross-modal consistency as the degree to which a multimodal model produces the same output when presented with the same task instance across different modalities, assuming the information necessary for solving the task is preserved during modality conversion. They propose that a model MM exhibits consistency between modalities aa and bb if:

M(da,q)=M(Ka,bq(da),q),daDa, qQM(d_a, q) = M(K^{q}_{a,b}(d_a), q), \forall d_a \in D_a, \ q \in \mathcal{Q}

Where:

  • MM is the multimodal model
  • dad_a is a data element from input space Da\mathcal{D}_a corresponding to modality aa
  • qq is the abstract query
  • Ka,bqK^{q}_{a,b} is an information-preserving converter mapping data elements from modality aa to bb

To evaluate cross-modal consistency, the authors construct a vision-language parallel dataset spanning seven tasks:

  • Math Equation Solving (easy and hard)
  • Logical Reasoning
  • Table Understanding
  • State Machine Reasoning
  • Reading Comprehension

These datasets are designed such that data instances can be converted between image and text formats while preserving task-related information, using Optical Character Recognition (OCR) and screenshot software as converters.

The core of the evaluation framework involves comparing a model's performance on paired instances (da(i),q)(d^{(i)}_a, q) and (db(i),q)(d^{(i)}_b, q) and the task consistency score CtC_t is computed as:

Ct=1ni=1ncMiC_t = \frac{1}{n}\sum_{i=1}^{n} c^i_M

where

c<sup>iM={</sup>1,amp;if M(d<sup>(i)a,</sup>q)=M(d<sup>(i)b,</sup>q) 0,amp;otherwisec<sup>i_M=\begin{cases}</sup> 1, &amp; \text{if } M(d<sup>{(i)}_a,</sup> q) = M(d<sup>{(i)}_b,</sup> q)\ 0, &amp; \text{otherwise} \end{cases}

The authors conduct experiments using GPT-4V, evaluating its cross-modal consistency on the constructed datasets. The results reveal significant inconsistencies between the vision and language modalities. GPT-4V demonstrates varying performance depending on whether the task is prompted in one modality versus the other. In tasks involving intricate reasoning, such as equation solving, math/logical reasoning, and state machine reasoning, the model exhibits lower accuracy with image inputs compared to text inputs. Conversely, in tasks focused on information extraction and comprehension, such as language understanding and table understanding, the model shows near-perfect performance with text inputs but a substantial drop in accuracy with image inputs.

To investigate whether the observed performance gap is due to the model's inability to access information from images, the authors conduct an ablation paper involving Optical Character Recognition (OCR) on image inputs. The results indicate that the model can accurately extract information from images, suggesting that the performance gap is primarily attributable to the model's internal reasoning processes for each modality. Conditional consistency scores are also reported for image instances, considering correct versus incorrect Optical Character Recognition (OCR) results.

To address the identified cross-modal inconsistency, the authors introduce a method called Vision-Depicting-Prompting (VDP). VDP involves a two-step process:

  1. Prompting the model to extract and articulate a textual description of the image task.
  2. Prompting the model to provide an answer, considering both the textual description and the original image input.

The experimental results demonstrate that VDP improves accuracy in vision-based tasks compared to naive prompting. In tasks requiring reasoning abilities, VDP yields an average accuracy enhancement of 19%. In tasks centered around understanding, VDP achieves an average accuracy increase of 57%, with performance reaching parity with text-based prompting in some cases. The authors also observe a substantial increase in the consistency score with VDP compared to prompting with plain images.

The authors conclude that multimodal systems like GPT-4V maintain relatively independent internal representations of reasoning between visual and textual signals. They suggest that these findings offer insights into the potential applications of multimodal systems and highlight the need for more integrated system designs. The introduction of the Vision-depicting-Prompting (VDP) solution is presented as an effective approach to address cross-modal inconsistency.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiang Zhang (395 papers)
  2. Senyu Li (5 papers)
  3. Ning Shi (16 papers)
  4. Bradley Hauer (11 papers)
  5. Zijun Wu (19 papers)
  6. Grzegorz Kondrak (14 papers)
  7. Muhammad Abdul-Mageed (102 papers)
  8. Laks V. S. Lakshmanan (58 papers)